Tekstlab - NoTa

Norwegian Speech Corpora

The Norwegian Speech Corpora below are a collection of several subcorpora, hosted and partly or fully developed at the Text Laboratory, UiO, sometimes in cooperation with others. Some corpora are still under development but can already be used.

	NoTa-Oslo Norsk talespråkskorpus-Oslo-delen	Info
	TAUS Talemålsundersøkelsen i Oslo	Info
	BigBrother	Info
	Nordisk dialektkorpus - Scandinavian Dialect Corpus	Info
	UPUS Utviklingsprosesser i urbane språkmiljøer	Info

NoTa-Oslo [Norsk talespråkskorpus-Oslo-delen] Oslo speech 2005 [Homepage] [Search]
- A corpus of orthographically transcribed speech with linked audio and video files
- Informants carefully selected w.r.t. sociolinguistic variables
- Time of recording: 2005
- Place of recording: Oslo and Oslo area
- Number of informants: 166
- Number if words: approx. 900 000
- Type of material: Interviews and conversations
- Status: Finished

TAUS [Talemålsundersøkelsen i Oslo] Oslo speech from the 1970s [Homepage] [Search]
- Originally a corpus of phonologically transcribed speech with non-linked sound files
- Transcribed orthographically with linked audio files in 2006 - 2007
- Informants carefully selected w.r.t. sociolinguistic variables
- Time of recording: 1970-1975
- Place of recording: Oslo (Frogner og Vålerenga)
- Number of informants: 59
- Number if words: approx. 244 000
- Type of material: Interviews
- Status: Finished

Big Brother [TV-show] Talemål fra unge voksne [Homepage] [Search]
- A corpus of orthographically transcribed speech with linked audio and video files
- The informants are 10 young adults from several places in Norway
- Time of recording: 2001
- Number of words: approx 550 000
- Type of material: Many different kinds of situations in the BigBrother house
- Status: Finished

Nordisk dialektkorpus - Scandinavian Dialect Corpus [Homepage] [Search]
Nordic Dialect Corpus is a corpus of Norwegian, Swedish, Danish, Faroese and Övdalian (and soon Icelandic and Finland Swedish) spoken language. It consists of spontaneous speech data from dialects of the North Germanic languages across all of the Nordic countries. The linguistic data in the corpus comes frome a variety of sources, both old and new. The corpus contains nearly 2 million words from conversations by dialect speakers. It is transcribed and linked to audio and video, has a map function, and can be searched in a large variety of ways.

The Nordic dialect corpus and database are being developed in cooperation with our partners in the Nordic network ScanDiaSyn and the Nordic Center of Excellence, NORMS. The corpus is already available for research.

UPUS [Utviklingsprosesser i urbane språkmiljøer] [Homepage] [Search]
- Corpus under developement at the UPUS-project. Project leader Brit Mæhlum, INL, NTNU

Multimedia representation of corpora
The fact that all the speech is transcribed makes it searchable. In time all the corpora will be linked to sound and (in some cases) video files. The corpora are in the process of being grammatically tagged. The results are presented as concordances, where each line is clickable for listening and viewing sound and video files. The files are also individually downloadable, and listenable.

Multiple search options
The speech corpora are or will be searchable via words, strings of words, parts of words, grammatical tags, and events.

Permission
Fill in this form to get permission.

Research options
The way the corpora are or will be represented by high-quality transcriptions, tagging, other annotation, video and sound files, make them useful for many kinds of linguistic research: syntax, morphology, phonology, phonetics, semantics, lexicography, language technology and computational linguistics, discourse analysis, sociolinguistics etc. Given the speech modality, and the fact that the corpora have been recorded in different situations and of different people, these corpora are also useful for topics related to special studies of language in particular settings or of particular types, such as emotive situations. For this reason, it is also useful for studies in artificial intelligence, as well as psychology and sociology.

Other speech corpora

In The press:

The NoTa-Oslo-project and The Big Brother-corpus

The UPUS project

The ScanDiaSyn project

Contact