CANS - Corpus of American Nordic Speech


CANS - Corpus of American Nordic Speech v.3.1 (latest version published 27. January 2021) consists of interviews and conversations with 246 American Norwegian informants from 57 locations in USA and Canada, all in all more almost 773 000 words. CANS v.3.1 includes recordings and transcriptions from Janne Bondi Johannessen et al. (2010 - 2016) together with older recordings and transcriptions from Didrik Arup Seip and Ernst W. Selmer (1931), Einar Haugen (1942) and Arnstein Hjelde (1987, 1990, 1992).

September 2017 the corpus was enhanced with American Swedish: nearly 46 000 words spoken by 22 informants from seven locations in USA. The Swedish recordings are collected by Ida Larsson et al. (2011 - 2014).

The corpus is freely available for research using login with Feide or Clarin. (Contact us if you need another login alternative.)

The interviews and conversations in the corpus are transcribed in two ways: A phonetic transcription and an orthographic transcription. The transcriptions are connected to each other and to the original audio and video files.

Download the transcriptions
The transcriptions are downloadable, some of them in html format, some in text format.

Read or download the transcriptions:


Please refer to the corpus with this reference:
Johannessen, Janne Bondi. 2015. The Corpus of American Norwegian Speech (CANS). In Béata Megyesi (ed.): Proceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA 2015, May 11-13, 2015, Vilnius, Lithuania. NEALT Proceedings Series 23.
Download.

Please also add the corpus URL:

CANS - Corpus of American Nordic Speech v.3.1: https://tekstlab.uio.no/norskiamerika/english/corpus.html


Tools

Phonetic transcription: The first recordings were transcribed with Transcriber. At present we use the transcription tool ELAN.

Orthographic transcription: The Oslo Transliterator - a semi-automatic dialect transliterator developed at the Text Laboratory - is used for making orthographic transcriptions out of the phonetic transcriptions; both for Norwegian and Swedish. The orthographic transcriptions are proof-read against the audio files.

Morphosyntactic tagging Norwegian: The transcriptions are tagged with morphosyntactic categories by a statistical tagger (TreeTagger) first developed for the NoTa-Oslo corpus. The tagger has achieved a performance of 96.9 % by 10-fold cross validation.

Morphosyntactic tagging Swedish: The Swedish tagger is a TnT tagger, see Kokkinakis (2003). The tagger is trained on the Swedish PAROLE corpus and manually tagged orthographic Övdalian transcriptions from Nordic Dialect Corpus.

The technical solutions were originally developed for The Nordic Dialect Corpus and financed by NorDiaSyn and NordForsk.

Search tool: CANS is searchable through Glossa, a search tool developed at the Text Laboratory. Glossa can offer a modern user friendly and functional user interface. The work is financed by the CLARINO project.


More about the transcriptions

Phonetic transcription: In a phonetic transcription the dialect features will be clearly presented in the written representation, whether they are phonological, morphological, syntactic or lexical. A written representation of speech is a great help for the linguist to get a quick overview of the material.

The phonetic transcription standard is based on Papazian and Helleland's Norsk talemål. Lokal og sosial variasjon (2005), but with no special characters, only the Norwegian/Swedish alphabet. Also, the transcription is quite broad. The transcription standard in CANS is basically the same as that used for Norwegian in the Nordic Dialect Corpus. The standard is also used for the American Swedish transcription.

Orthographic transcription: An orthographic transcription is important because it is a generalization over all the variation. It enables the possibility of doing general searches, and automated methods, such as grammatical tagging. The orthographic transcription is much faster than the phonetic one, thanks to the semi-automatic dialect transliterator which translates from the phonetic transcription to Norwegian Bokmål or Swedish orthography.


The two transcriptions exemplified

Phon.: d e haRd tu finn
Orthogr.: det er hard to finne
Transl.: it is hard to find

 

 

 

Phon.: vi sellt ri å rennta ut resst'n
Orthogr.: vi solgte noe av det og renta ut resten
Transl.: we sold some of it and let out the rest

 

 

 

 



Sweet welcome for the Norwegian researchers in Blair. Photo: K. M. Eide.


Search the corpus



Janne and Signe with informants in Sunburg.



Contact:
tekstlab-post@iln.uio.no