Tekstlab home page

UiO home page Norsk         

The Oslo Corpus of Tagged Norwegian Texts
(bokmål and nynorsk parts)

The bokmål part of the Oslo Corpus contains about 18.5 million words, while the nynorsk part contains about 3.8 million words. The corpora have been coded according to the IMS Corpus Workbench standard (Institut für Maschinelle Sprachverarbeitung, University of Stuttgart). The search interface has been developed at the Text Laboratory.

  1. Contents of the corpus
  2. Available search methods
  3. How to get permission to use the corpus
  4. Technical information
  5. Frequency lists
  6. Publications
  7. Version
  8. Planned improvements
  9. Users of the corpus
  10. How to contact us
Concordance with newspaper background

[Search the bokmål corpus] [Search the nynorsk corpus] [Text Laboratory home page]


Contents of the corpus

The corpus consists of the texts that were available at the Text Laboratory in January 1999. It is composed of texts from three genres: fiction (bokmål: 1.7 mill. words; nynorsk: 2.1 mill.), newpapers/magazines (bokmål: 9.6 mill.; nynorsk: 1 mill.), and factual prose (bokmål: 7.1 mill.; nynorsk: 700.000). All fiction comes from ECI (European Corpus Initiative) and Norsk Tekstarkiv (Norwegian Text Archive), Bergen (now: HIT-senteret). The texts from newspapers and magazines have been collected by the Text Laboratory with kind permission from the various editorial offices. The factual prose consists mainly of NOU reports (Norwegian Official Reports) and Norwegian laws and regulations. A detailed survey of the texts, with source annotation codes, is given here.

The corpus is not meant to be representative in any sense, although it contains texts from a variety of genres. The main purpose of the corpus is to offer a large amount of text which researchers can use for searching. However, since it is possible to restrict the search to specific sources, the corpus can be used as a tailored corpus - you could choose to search in all newpaper texts or all of the fiction or all of the factual prose, or single texts, or any combination of these. (Cf. also ENPC.)

The corpus project, which includes gathering of texts, grammatical tagging, source annotation, IMS coding, and development of the web interface, has been led by Janne Bondi Johannessen. Diana Santos developed the original web interface for regular expressions (for The Oslo Corpus of Bosnian Texts), while Sigurd Schiøth and Anders Nøklestad extended the interface so as to support searching by clicking in check boxes. Tore Bjertnes Pedersen and Anders Nøklestad have created the codes for source annotation, based on similar work at Seksjon for leksikografi og målføregransking (Section for lexicography and dialect research). The grammatical tagging was mainly done by Kristin Hagen (the morphological part) and Anders Nøklestad (the syntactic part) (but click here to find a complete list of persons involved). Certain parts of the tagger (viz. the multi-tagger) have been developed in collaboration with Dokumentasjonsprosjektet (the Documentation Project) (led by Christian-Emil Ore), and the programming has been performed by Lars-Jørgen Tvedt and partly by Helge Hauglin.

[Search the bokmål corpus] [Search the nynorsk corpus] [Text Laboratory home page]


Survey of grammatical tags

A lot of work has gone into the grammatical tagging of the corpus. The development of the tagger itself has involved six man-labour years, mainly financed by the Norwegian Research Council, the Documentation Project and the Text Laboratory. We have used software developed by Lingsoft, Finland, which runs with a kind of dependency grammar called Constraint Grammar. It is possible to search the corpus for specific tags.

Morphological tags

The morphological tags are, strictly speaking, morphosyntactic tags. They indicate part of speech along with all common categories and their features, such as gender (masculine, feminine, neuter), number (singular, plural), definiteness (definite, indefinite), tense (present tense, past tense), just to mention a few. A full survey is given here. As far as possible, we have followed Norsk Referansegrammatikk (Norwegian Reference Grammar) in our choice of parts of speech and morphosyntactic features. This has led to some untraditional classifications, for instance, all words that were earlier called locative adverbs are now being classified as prepositions.

Syntactic tags

The syntactic tags indicate common syntactic functions like Subject and Object. All syntactic tags are preceded by a Commercial At (@). Since we are using a kind of dependency grammar, where every word is labelled either as a head or as a modifier, there are also quite a few less traditional tags, e.g.: @<SBU (SUBORDINATING CONJUNCTION modifying something to the left), @DET> (DETERMINER modifying something to the right), @KON (COORDINATING CONJUNCTION). An arrow on the syntactic tag means that the word is a modifier of a head which is found in the direction of the arrow. A full survey of the syntactic tags is given here.

Survey of source annotations

The codes for source annotation are based on similar work done at Seksjon for leksikografi og målføregransking (Section for lexicography and dialect research), University of Oslo. Here is an example:

Allbjart, Gunnar 'Flukten til livet' flukt.syn SK/AlGu/01

The source annotation is the code at the end of the line. SK means fiction ("skjønnlitteratur"); the codes for the other genres are as follows: AV=newspaper/magazine ("avis/ukeblad") and SA=factual prose ("sakprosa"). The four letters in the middle field indicate the name of the author, or the name and year of the newspaper/magazine, while the last number is a file index in case there are more than one work by the same author or more than one file from the same newspaper/magazine. Note that for the newspapers/magazines there is no relationship between the number of files and the number of issues. For instance, AV/Af94/01 contains 26 issues of Aftenposten from 1994. A full survey of source annotations is given here.

[Search the bokmål corpus] [Search the nynorsk corpus] [Text Laboratory home page]


Available search methods

It is possible to search for one, two or three words or parts of words (beginnings or endings), and the words can either be adjacent or separated by specified number of other words. One or more of the words may be specified, to different degrees, with regard to grammatical category, and you can also specify what kind of text you want to search. It is even possible to search on grammatical category alone, without naming any part of the words.

Note! Remember to clear the search form before each new search.

Examples of the major types of queries

Examples of combinations of the search criteria above

[Search the bokmål corpus] [Search the nynorsk corpus] [Text Laboratory home page]


Technical information

The IMS Corpus Workbench

This is a front-end to CQP, the Corpus Query Processor of the IMS Corpus Workbench developed by Oliver Christ and Bruno Maximilian Schulze at the Institut für Maschinelle Sprachverarbeitung at the University of Stuttgart. Here you can get to its Frequently Asked Questions list at http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/OldDocus/FAQ.html.

We gratefully acknowledge permission to use CQP for research purposes.

Those acquainted with the CQP query syntax can use (almost) all of its power. Particular restrictions are described below.

[Search the bokmål corpus] [Search the nynorsk corpus] [Text Laboratory home page]


Corpus structure and encoding

The corpus is encoded in the ISO-8859-1 character set. It is also possible to have the search results shown in pure ASCII format.

The corpus consists of the electronic Norwegian material that was available at the Text Laboratory by January 1999. We have received most of this material in electronic form, either directly from newspapers, authors or publishers, or by way of text collectors such as Humanistisk datasenter in Bergen (now: HIT-senteret) and ECI (European Corpus Initiative). We have also downloaded governmental information bulletins (NOU reports) from the Internet. We are very grateful that we have got permission from newpapers, publishers and authors to use their texts in this first Oslo corpus. We have not made any changes to the texts, except for deleting certain numerical tables in some of the texts. We have not removed headlines, captions and other elements which might have been thought to create problems for a tagger. Instead, we designed the tagger to be able to handle such elements - albeit within limits.

The corpus has been tagged with a multitagger (developed by the Text Laboratory and the Documentation Project in collaboration), and then with our disambiguating tagger, developed by the Text Laboratory (using software by Lingsoft, Finland). The corpus has been automatically converted to CQP format from pure text files with meta information in the header and from an index containing the correct text identifier.

The corpus is not proofread.

Finally, there are a few differences between our corpus and the Corpus Workbench standard:

[Search the bokmål corpus] [Search the nynorsk corpus] [Text Laboratory home page]


Information on the search interface

The current search interface makes it possible to

The search result is shown together with a regular expression form of the query, the date, and the number of hits.

In some cases a warning or a helpful message is given. For example:

Important restrictions

In order to prevent users from downloading entire texts to their own machine, we have implemented the following restrictions:

[Search the bokmål corpus] [Search the nynorsk corpus] [Text Laboratory home page]


[Top of page]


Publications

Publications where the corpus has been used

If you use the corpus for lectures or written work, please tell us about it. We would like to extend the list of such work, since it is valuable for all of us.

About tagging

Scientific journals and anthologies:

Unpublished:

[Search the bokmål corpus] [Search the nynorsk corpus] [Text Laboratory home page]


Version

This is version 2 of the corpus, tagged using version 2 of the multitagger and version 2 of the disabiguating tagger.

[Search the bokmål corpus] [Search the nynorsk corpus] [Text Laboratory home page]


Planned improvements

We are planning to make some improvements, hopefully in the near future.

We want to continue to improve the Oslo Corpus. Therefore, we will appreciate all suggestions for improvements, either to tekstlab-post@iln.uio.no or to the corpus discussion list, oktnt-list@iln.uio.no. We would like to thank Stig Johansson, Elisabet Engdahl, Johan Laurits Tønnesson, and Carl Vikner for their valuable suggestions.

[Search the bokmål corpus] [Search the nynorsk corpus] [Text Laboratory home page]


Contact us.

Norwegian document created by Janne Bondi Johannessen, translated into English by Anders Nøklestad.
Last updated 7 May 2007 by AN.