Norsk

The Oslo Corpus of Tagged Norwegian Texts
(bokmål and nynorsk parts)

The bokmål part of the Oslo Corpus contains about 18.5 million words, while the nynorsk part contains about 3.8 million words. The corpora have been coded according to the IMS Corpus Workbench standard (Institut für Maschinelle Sprachverarbeitung, University of Stuttgart). The search interface has been developed at the Text Laboratory.

Contents of the corpus
- Survey of grammatical tags
- Survey of source annotations
Available search methods
Login
Technical information
Frequency lists
Publications
Version
Planned improvements
Users of the corpus
How to contact us

[Search the bokmål corpus] [Search the nynorsk corpus] [Text Laboratory home page]

Contents of the corpus

The corpus consists of the texts that were available at the Text Laboratory in January 1999. It is composed of texts from three genres: fiction (bokmål: 1.7 mill. words; nynorsk: 2.1 mill.), newpapers/magazines (bokmål: 9.6 mill.; nynorsk: 1 mill.), and factual prose (bokmål: 7.1 mill.; nynorsk: 700.000). All fiction comes from ECI (European Corpus Initiative) and Norsk Tekstarkiv (Norwegian Text Archive), Bergen (now: HIT-senteret). The texts from newspapers and magazines have been collected by the Text Laboratory with kind permission from the various editorial offices. The factual prose consists mainly of NOU reports (Norwegian Official Reports) and Norwegian laws and regulations. A detailed survey of the texts, with source annotation codes, is given here.

The corpus is not meant to be representative in any sense, although it contains texts from a variety of genres. The main purpose of the corpus is to offer a large amount of text which researchers can use for searching. However, since it is possible to restrict the search to specific sources, the corpus can be used as a tailored corpus - you could choose to search in all newpaper texts or all of the fiction or all of the factual prose, or single texts, or any combination of these. (Cf. also ENPC.)

The corpus project, which includes gathering of texts, grammatical tagging, source annotation, IMS coding, and development of the web interface, has been led by Janne Bondi Johannessen. Diana Santos developed the original web interface for regular expressions (for The Oslo Corpus of Bosnian Texts), while Sigurd Schiøth and Anders Nøklestad extended the interface so as to support searching by clicking in check boxes. Tore Bjertnes Pedersen and Anders Nøklestad have created the codes for source annotation, based on similar work at Seksjon for leksikografi og målføregransking (Section for lexicography and dialect research). The grammatical tagging was mainly done by Kristin Hagen (the morphological part) and Anders Nøklestad (the syntactic part) (but click here to find a complete list of persons involved). Certain parts of the tagger (viz. the multi-tagger) have been developed in collaboration with Dokumentasjonsprosjektet (the Documentation Project) (led by Christian-Emil Ore), and the programming has been performed by Lars-Jørgen Tvedt and partly by Helge Hauglin.

[Search the bokmål corpus] [Search the nynorsk corpus] [Text Laboratory home page]

Survey of grammatical tags

A lot of work has gone into the grammatical tagging of the corpus. The development of the tagger itself has involved six man-labour years, mainly financed by the Norwegian Research Council, the Documentation Project and the Text Laboratory. We have used software developed by Lingsoft, Finland, which runs with a kind of dependency grammar called Constraint Grammar. It is possible to search the corpus for specific tags.

Morphological tags

The morphological tags are, strictly speaking, morphosyntactic tags. They indicate part of speech along with all common categories and their features, such as gender (masculine, feminine, neuter), number (singular, plural), definiteness (definite, indefinite), tense (present tense, past tense), just to mention a few. A full survey is given here. As far as possible, we have followed Norsk Referansegrammatikk (Norwegian Reference Grammar) in our choice of parts of speech and morphosyntactic features. This has led to some untraditional classifications, for instance, all words that were earlier called locative adverbs are now being classified as prepositions.

Syntactic tags

The syntactic tags indicate common syntactic functions like Subject and Object. All syntactic tags are preceded by a Commercial At (@). Since we are using a kind of dependency grammar, where every word is labelled either as a head or as a modifier, there are also quite a few less traditional tags, e.g.: @<SBU (SUBORDINATING CONJUNCTION modifying something to the left), @DET> (DETERMINER modifying something to the right), @KON (COORDINATING CONJUNCTION). An arrow on the syntactic tag means that the word is a modifier of a head which is found in the direction of the arrow. A full survey of the syntactic tags is given here.

Survey of source annotations

The codes for source annotation are based on similar work done at Seksjon for leksikografi og målføregransking (Section for lexicography and dialect research), University of Oslo. Here is an example:

Allbjart, Gunnar 'Flukten til livet' flukt.syn SK/AlGu/01

The source annotation is the code at the end of the line. SK means fiction ("skjønnlitteratur"); the codes for the other genres are as follows: AV=newspaper/magazine ("avis/ukeblad") and SA=factual prose ("sakprosa"). The four letters in the middle field indicate the name of the author, or the name and year of the newspaper/magazine, while the last number is a file index in case there are more than one work by the same author or more than one file from the same newspaper/magazine. Note that for the newspapers/magazines there is no relationship between the number of files and the number of issues. For instance, AV/Af94/01 contains 26 issues of Aftenposten from 1994. A full survey of source annotations is given here.

[Search the bokmål corpus] [Search the nynorsk corpus] [Text Laboratory home page]

Available search methods

It is possible to search for one, two or three words or parts of words (beginnings or endings), and the words can either be adjacent or separated by specified number of other words. One or more of the words may be specified, to different degrees, with regard to grammatical category, and you can also specify what kind of text you want to search. It is even possible to search on grammatical category alone, without naming any part of the words.

Note! Remember to clear the search form before each new search.

Examples of the major types of queries

Single words. Find all instances of jente: Write jente in the field Første ord. Click Søk i korpuset.

Prefixes. Find all words beginning with be-: Write be in the field Første ord. Click the checkbox marked Begynnelse av ord. Click Søk i korpuset (examples: bena, bestemt).

Suffixes. Find all words ending in -else: Write else in the field Første ord. Click in the checkbox marked Endelse av ord. Click Søk i korpuset (examples: forbauselse, forskrekkelse).

Word sequences. Find all sequences of adjacent words where the first one ends in -r and the second one begins with be-: Write r in the field Første ord, and click the checkbox marked Endelse av ord, select maks 0 ord mellom, write be in the field Andre ord, and click the checkbox marked Begynnelse av ord. Click Søk i korpuset (examples: eller begynne, har bestemt).

Broken sequence - with intervening words. Find all instances of the word jeg followed by the word og with no more than seven words in between: Write jeg in Første ord, select maks 7 ord mellom, write og in Andre ord. Click Søk i korpuset (example: ...jeg var ute i samme ærend og ble glad...)

Restrict the search to certain kinds of text. Find all words beginning with be- in the fiction material: Write be in Første ord, click the checkbox marked Begynnelse av ord, click on Velg tekster, and select Alle in the "Skjønnlitteratur" menu and click on Ingen below the newspaper and factual prose menus. Click Søk i korpuset (examples: bena, bestemt).

Restrict the search with regard to grammatical category. Find all verbs in the present tense that are not compounds: Do not write anything in the fields Første ord, Andre ord or Tredje ord. Select Verb from the Grammatiske kategorier menu below Første ord, click Morfosyntaktiske trekk and then on the left radio button for Presens in the window that appears. Click OK. Select Annet in the Utelukk kategori(er) menu below Første ord and click on Sammensetning in the window that appears. Click OK and Søk i korpuset (examples: puster, bestemmer, but not pustet, bestemt, massekopierer).

Examples of combinations of the search criteria above

Find all words beginning with be- that are verbs in the fiction material: Write be in Første ord, click the checkbox marked Begynnelse av ord, select Verb in the Grammatiske kategorier menu, click on Velg tekster, select Alle in the "Skjønnlitteratur" menu and click on Ingen below the newspaper and factual prose menus. Click OK and Søk i korpuset (examples: bestemt, begynner, but not bena, begynnelse).

Find all words beginning with be- that are verbs in the present tense, in fiction and factual prose: Write be in Første ord, click the checkbox marked Begynnelse av ord, select Verb in the Grammatiske kategorier menu, click on Morfosyntaktiske trekk and then on the left radio button for Presens in the window that appears, click Velg tekster, select Alle in the "Skjønnlitteratur" menu and Alle in the "Sakprosa" menu, and click on Ingen below the newspaper menu. Click OK and Søk i korpuset (examples: bestemmer, begynner, but not bena, begynnelse, bestemt).

Find all words beginning with be- that are verbs in the present tense in Aftenposten: Write be in Første ord, click in the checkbox marked Begynnelse av ord, select Verb in the Grammatiske kategorier menu, click on Morfosyntaktiske trekk and then on the left radio button for Presens in the window that appears, click Velg tekster, select Aftenposten in the "Aviser" menu and click on Ingen below the factual prose and fiction menus. Click OK and Søk i korpuset (examples: bestemmer, begynner, but not bena, begynnelse, bestemt).

Find all verbs in Aftenposten that are not in the past tense: Do not write anything in the fields Første ord, Andre ord or Tredje ord. Select Verb in the Grammatiske kategorier menu, click on Morfosyntaktiske trekk and then on the right radio button for Presens in the window that appears, click Velg tekster, select Aftenposten in the "Aviser" menu and click on Ingen below the factual prose and fiction menus. Click OK and Søk i korpuset (examples: pustet, bestemmer).

Find all verbs that are followed by a preposition in the fiction material: Do not write anything in the fields Første ord, Andre ord or Tredje ord. Select Verb in the Grammatiske kategorier menu below Første ord and Preposisjon in the corresponding menu below Andre ord, click on Velg tekster, select Alle in the "Skjønnlitteratur" menu and click on Ingen below the newspaper and factual prose menus. Click OK and Søk i korpuset (examples: pustet ut, bestemmer for).

[Search the bokmål corpus] [Search the nynorsk corpus] [Text Laboratory home page]

Login

The corpus is freely available for research using login with Feide or eduGAIN. (Contact the Text Laboratory if you need another login alternative.)

Technical information

The IMS Corpus Workbench

This is a front-end to CQP, the Corpus Query Processor of the IMS Corpus Workbench developed by Oliver Christ and Bruno Maximilian Schulze at the Institut für Maschinelle Sprachverarbeitung at the University of Stuttgart. Here you can get to its Frequently Asked Questions list at http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/OldDocus/FAQ.html.

We gratefully acknowledge permission to use CQP for research purposes.

Those acquainted with the CQP query syntax can use (almost) all of its power. Particular restrictions are described below.

[Search the bokmål corpus] [Search the nynorsk corpus] [Text Laboratory home page]

Corpus structure and encoding

The corpus is encoded in the ISO-8859-1 character set. It is also possible to have the search results shown in pure ASCII format.

The corpus consists of the electronic Norwegian material that was available at the Text Laboratory by January 1999. We have received most of this material in electronic form, either directly from newspapers, authors or publishers, or by way of text collectors such as Humanistisk datasenter in Bergen (now: HIT-senteret) and ECI (European Corpus Initiative). We have also downloaded governmental information bulletins (NOU reports) from the Internet. We are very grateful that we have got permission from newpapers, publishers and authors to use their texts in this first Oslo corpus. We have not made any changes to the texts, except for deleting certain numerical tables in some of the texts. We have not removed headlines, captions and other elements which might have been thought to create problems for a tagger. Instead, we designed the tagger to be able to handle such elements - albeit within limits.

The corpus has been tagged with a multitagger (developed by the Text Laboratory and the Documentation Project in collaboration), and then with our disambiguating tagger, developed by the Text Laboratory (using software by Lingsoft, Finland). The corpus has been automatically converted to CQP format from pure text files with meta information in the header and from an index containing the correct text identifier.

The corpus is not proofread.

Finally, there are a few differences between our corpus and the Corpus Workbench standard:

The structure of the corpus does not permit formal units like paragraphs and sentences to be included in the query.
Each word in the corpus has its own source annotation. We have arranged for the source to be shown for each line in the concordance.
Capital and non-capital letters have different encodings.
Punctuation marks have been encoded as separate characters, so that it is possible to search for e.g. commas.

[Search the bokmål corpus] [Search the nynorsk corpus] [Text Laboratory home page]

Information on the search interface

The current search interface makes it possible to

search by clicking and writing
to have the results shown in Latin 1 or in pure ASCII
to select the amount of context that is to be shown in the concordance
to have only a specified number of randomly chosen hits shown
to select the kind of search result (concordance, distribution of forms or sources, or a combination)
to select concordance without tags, with tags only on the search item, or with tags both on the search item and on the context
to sort the concordance by source, search string, or the preceding or following word or punctuation mark.

The search result is shown together with a regular expression form of the query, the date, and the number of hits.

In some cases a warning or a helpful message is given. For example:

Do not ask for a distribution of forms when the search expression only corresponds to a single form
Do not use * instead of .* (a* means a number of a's, not a followed by something else; to get that you would have to write a.*)
Do not use spaces in the middle of a search expression. If you want two words, you have to enclose them in quotation marks.

Important restrictions

In order to prevent users from downloading entire texts to their own machine, we have implemented the following restrictions:

You are not allowed to request a context larger than 500 characters. No matter how large the number entered, the maximum context you'll see will be 500 characters long.
You are not allowed to get sequences longer than 200 words (from the beginning of the search expression til the end). Longer expressions will be reduced to 200 words.

[Search the bokmål corpus] [Search the nynorsk corpus] [Text Laboratory home page]

[Top of page]

Publications

Publications where the corpus has been used

Helle Asmussen. 2000. Korpus 2000 - En unders�gelse af brugergrupper og korpusv�rkt�jer. Prosjektoppgave, Institut for Datalingvistik, Handelsh�jskolen i K�benhavn. (HTML, Postscript)
Philipp Conzett. 2002. Fr� einskap til ulikskap? Ei gransking av genustilh�vet ved avleiingar p� -skap i skandinavisk. Term paper, University of Tromsø.
Hanne Ragnhild Eliassen. 2002. Frekvens og norske verb. Hvordan kan verb klassifiseres, og hvordan p�virker frekvens verbene? Cand.philol. thesis, University of Oslo.
Elisabet Engdahl. 1999. Valet av passivform i modern svenska. Lecture given at Svenskans beskrivning 24 in Linköping.
Elisabet Engdahl. 1999. The choice between bli-passive and s-passive in Danish, Norwegian and Swedish. NORDSEM-report no. 3. (Postscript)
Martin Hilpert. 2002. Semantik und Syntax von Verben der Meinungs�usserung im D�nischen, Norwegischen und Schwedischen. Eine kompararative, korpusbasierte Fallstudie. Universit�t Hamburg.
Janne Bondi Johannessen. 1998. Negasjonen ikke: Kategori og syntaktisk posisjon. MONS 7. Utvalde artiklar frå det 7. Møtet om Norsk Språk i Trondheim 1997. ISBN 82-7099-307-7
Fredrik Andersen Kavli. 2001. Korpusargumenter. Cand.philol. thesis, University of Bergen. (HTML)
Arild Lian, Paul J. Karlsen, and Bendik Winswold. 2001. A re-evaluation of the phonological similarity effect in adults' short-term-memory of words and nonwords. Memory, 9 (4,5,6), 281-299.
Arne Martinus Lindstad. 1999. Issues in the Syntax of Negation and Polarity in Norwegian. A Minimalist Analysis. Cand.philol. thesis, University of Oslo.
Victoria Rosén, 2000. Er norsk et naturlig spr�k? In: �ivin Andersen, Kjersti Fl�ttum and Torodd Kinn (eds.), Menneske, spr�k og fellesskap. Festskrift til Kirsti Koch Christensen p� 60-�rsdagen, 1. desember 2000, Oslo, Novus forlag.
Grete Seland, 2001. The Norwegian Reflexive Caused Motion Construction. A Construction Grammar Approach. Cand.philol. thesis, University of Oslo.
Henrik Stiansen, 2001. Indirekte objekt i norsk. Cand.philol. thesis, University of Oslo.
Ingebj�rg Tonne, 2001. Progressives in Norwegian and the Theory of Aspectuality. Dr.art thesis, University of Oslo, Acta Humaniora, Unipub/Gnist-Akademika. (Postscript)
Øystein Alexander Vangsnes. 2001. Distributiv possessiv - en binominal konstruksjon. In Inger Moen (et al.), Mons 9: Utvalgte artikler fra Det niende møtet om norsk språk i Oslo 2001, 230-243. Oslo: Novus.

If you use the corpus for lectures or written work, please tell us about it. We would like to extend the list of such work, since it is valuable for all of us.

About tagging

Scientific journals and anthologies:

Kristin Hagen, Janne Bondi Johannessen and Anders Nøklestad. 2000. A Web-Based Advanced and User Friendly System: The Oslo Corpus of Tagged Norwegian Texts. In Gavrilidou, M., G. Carayannis, S. Markantonatou, S. Piperidis and G. Stainhaouer (eds.): Proceedings of the Second International Conference on Language Resources and Evaluation, Athens, Greece 31 May - 2 June 2000.
Kristin Hagen, Janne Bondi Johannessen and Anders Nøklestad. 2000. A Constraint-Based Tagger for Norwegian. In Lindberg, C.-E. and S. Nordahl Lund (eds.): 17th Scandinavian Conference of Linguistics, vol. I. Odense: Odense Working Papers in Language and Communication, No. 19, vol I.
Kristin Hagen, Janne Bondi Johannessen and Anders Nøklestad. 2000. The shortcomings of a tagger. In Proceedings from the 12th "Nordiske datalingvistikkdager", Trondheim 9-10 December, 1999. Trondheim: Lingvistisk institutt, NTNU ).
Janne Bondi Johannessen. 1998. Tagging and the case of pronouns. Computers and the Humanities. ISSN 0010-4817
Janne Bondi Johannessen. 1998. Elektroniske hjelpemidler - leksikografisk fornying. Norskrift. ISSN 0800.7764
Kristin Hagen and Janne Bondi Johannessen. 1998. Disambiguering uten syntaks. MONS 7. Utvalde artiklar frå det 7. Møtet om Norsk Språk i Trondheim 1997. ISBN 82-7099-307-7
Anders Nøklestad. 1998. Statistisk disambiguerende tagging av norsk. MONS 7. Utvalde artiklar frå det 7. Møtet om Norsk Språk i Trondheim 1997. ISBN 82- 7099-307-7
Janne Bondi Johannessen and Helge Hauglin.1998. An Automatic Analysis of Norwegian Compounds. Papers from the 16th Scandinavian Conference of Linguistics, Turku/Åbo, Finland. ISBN 951-29-1327-5

Unpublished:

Kristin Hagen, Janne Bondi Johannessen og Kristian Emil Kristoffersen. 1997. Problemer ved bruk av andres lister til taggerformål. Foredrag presentert på Møte om norsk språk 7, Universitetet i Trondheim.

[Search the bokmål corpus] [Search the nynorsk corpus] [Text Laboratory home page]

Version

This is version 2 of the corpus, tagged using version 2 of the multitagger and version 2 of the disabiguating tagger.

[Search the bokmål corpus] [Search the nynorsk corpus] [Text Laboratory home page]

Planned improvements

We are planning to make some improvements, hopefully in the near future.

Collocations. We will offer collocations for the search word.
Frequency lists. We will create frequency lists for all of the text types.
Random selection with even distribution among text types. We will offer the opportunity to search for a certain number of randomly selected instances where the instances are evenly distributed among the various text types.
Remove articles etc. in the wrong language variety. We will continue to remove extensive nynorsk texts from the bokmål material and vice versa.
The layout on the click-and-write pages will be continuously evaluated and improved.

We want to continue to improve the Oslo Corpus. Therefore, we will appreciate all suggestions for improvements, either to tekstlab-post@iln.uio.no or to the corpus discussion list, oktnt-list@iln.uio.no. We would like to thank Stig Johansson, Elisabet Engdahl, Johan Laurits Tønnesson, and Carl Vikner for their valuable suggestions.

[Search the bokmål corpus] [Search the nynorsk corpus] [Text Laboratory home page]

Contact us.

Norwegian document created by Janne Bondi Johannessen, translated into English by Anders Nøklestad.
Last updated 7 May 2007 by AN.

The Oslo Corpus of Tagged Norwegian Texts (bokmål and nynorsk parts)

Morphological tags

Syntactic tags

Survey of source annotations

Examples of the major types of queries

Examples of combinations of the search criteria above

Technical information

Important restrictions

Publications

Publications where the corpus has been used

About tagging

Contact us.

The Oslo Corpus of Tagged Norwegian Texts
(bokmål and nynorsk parts)