The Oslo Corpus of Bosnian Texts

[Korpus bosanskih tekstova na Univerzitetu u Oslu]     

The Oslo Corpus of Bosnian Texts consists of a corpus of approximately 1.5 million words, encoded with the IMS corpus workbench developed at the Institut fur Maschinelle Sprachverarbeitung at the University of Stuttgart, to which a suitable interface was added at the Text Laboratory.

  1. Contents of the corpus
  2. Types of queries available
  3. How to get permission to use the corpus
  4. How to get and produce the right fonts
  5. Technical information
  6. The most frequent 1,000 wordforms
  7. Available publications on the corpus
  8. Version
  9. Acknowledgement
  10. Users of the corpus
  11. How to contact us
Contents of the corpus

drawing of a tombstoneThis corpus has been compiled at the University of Oslo as a joint project between the Department for East European and Oriental Studies and the Text Laboratory. The corpus contains approximately 1.5 million words, and comprises several different genres: fiction (novels and short stories), essays, children's stories, folklore, islamic texts, legal texts, and newspapers and journals. The texts, written by authors from Bosnia and Herzegovina, have for the most part been published in the 1990s. The corpus provides a new and different basis for research into the language of Bosnia and Herzegovina.

The project has been supervised by assistant professor Janne Bondi Johannessen, while professor Svein Mønnesland was responsible for the selection and compilation of the texts. Gordana Vranic and Kemila Basic have made the texts electronically available (by scanning and adaptation) in simple text files. Diana Santos has built the corpus based on those files in the format requested by the corpus tools used (see below for more information), and has also written the Web interface.

The holders of the copyrights for all the texts have kindly granted permission for the use of the texts in this corpus. In the event that a text is taken from a book, it never covers more than three quarters of that book.

For a detailed overview of the contents in terms of size and source, see "Sadrzzaj" page.

Types of queries available

When querying the corpus, one can ask for a concordance (KWIC, KeyWord In Context, style - the default), or one can ask for the distribution of the results, in terms of forms, or in terms of text source. In addition, one can, in the very same query, ask for both the concordance and (one of) the distributions.

Even though we plan to provide - eventually - a simpler and fully menu-based query form, for the moment we rely almost completely on the CQP query syntax. It allows one to express in a compact way quite complex choices, using regular expressions.

Examples of Bosnian queries are:

It is important to be aware of the fact that, in addition to formal properties of the text, one can also make queries with such parameters such as text type, author, date, or even a particular work. For an overview of the possibilities offered by our classification of the texts, see "Sadrzzaj" page. Some examples:

How to see and produce the right fonts

In order to show the results with Bosnian characters, you have to have support for ISO-8859-2 in the computer you are running your browser. If the results of your search look ugly, you can If you cannot type Bosnian characters directly, you can use their octal codes, their standard "alongations", or the corresponding ISO-8859-1 character (Latin 1) instead. Here are the possibilities:

BosnianOctal codesAlongationLatin 1
Ch, ch\306ChÆ
CC, cc\310CCÈ
Dj, dj\320DjjÐ
Dz, dzD\256Dz
Ss, ss\251SS©
ZZ, zz\256ZZ, Zh®
\276zz, zh¾

Along with some examples:

Please note that
  1. in order to input octal codes, you have to enclose the words in quotes.
  2. in order to make the character encodings unambiguous, we have changed the standard representation of Dj, dj to Djj and djj instead. This does not apply to the display of the results, which follows the standard. In other words, you look for Djje in your query, but you'll see Dje if you selected all-ASCII mode.
  3. even if you input them as sequences of characters, the Bosnian characters are considered to be one character long, except for Dz, dz, which is regarded as D d fo llowed by ZZ, zz. Given that "." stands for any character in the CQP syntax, this means that e.g. stra.no will match strassno, but .amijskih will not match dzamijskih.
Technical information

The IMS Corpus Workbench

This is a front-end to CQP, the Corpus Query Processor of the IMS Corpus Workbench developed by Oliver Christ and Bruno Maximilian Schulze at the Institut fur Maschinelle Sprachverarbeitung at the University of Stuttgart. Here you can get to its Frequently Asked Questions list at http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/OldDocus/FAQ.html.

We gratefully acknowledge permission to use CQP for research purposes.

Those acquainted with the CQP query syntax can use (almost) all of its power. Particular restrictions are described below.

Corpus structure and encoding

The corpus is encoded in the ISO-8859-2 character set. Instructions on how to configure your browser for some of the most common platforms can be read here.

Since it cannot be expected that every user will have access to a browser which allows the correct display of ISO-8859-2 encoded documents, the all-ASCII display option is available in the query form, which caters for the standard two character display of Bosnian-specific characters, as described above.

The corpus was created by scanning books and other printed material with an optical character recognizer (OCR); in some rare cases, the material obtained was already in electronic format. A few editorial alterations were made:

The corpus was automatically derived in CQP format from Word text-only files with meta-information as header, and from a table of contents including the correct text identifier, which was created as a Word file by Gordana Vranic.

The corpus was not manually revised after the conversion, so it is possible that some problems will appear. Please report any such problems, as well as general problems, suggestions for improvement, etc., to us.

Finally, there are a number of points users should take into consideration when querying the system. These concern the way the corpus is stored inside the Corpus Workbench itself:

Information on the search interface

The current search interface allows you The output is returned with an indication of the query issued by the user, the date, and the number of matches.

If the number of matches was not null and a concordance was requested, the number of instances found, and the number of instances that will actually be displayed are shown, followed by the instances found, with the actual match emphasized. If a distribution was requested, it is output in a simple table format, in decreasing order of frequency.

In some cases, warning or help messages are issued. The latter are meant to give some help to a first-time user. For example,


In order to prevent users from downloading the whole texts onto their own machines, the following restrictions were implemented:

Comparison with using CQP directly

Compared with using CQP in your own machine, in addition to performance downgrading there are some features that are missing, most significantly: The restrictions described above do not hold if you have direct access to CQP and the corpus in your machine.

However, the display of the source identification, together with each example, is an improvement relative to the CQP and Xkwic programs.

Planned improvements

In the future, we plan to add the following capabilities to the Web interface: Suggestions for other capabilities, as well as constructive complaints, are always welcome.

Available publications on the corpus

Browne 98
Browne, Wayles. Agreement with infinitive subjects in Slavic; with a note on Corbett's notion of `real distance'. (Paper given at workshop on Comparative Slavic Morphosyntax, Bloomington, Indiana, 5-7 June 1998)
Jakopin 99
Jakopin, Primoz. Upper bound of entropy in Slovenian literary texts (paper written in Slovenian; English abstract here). Ph.D thesis, Ljubljana University.
Leko 98a
Leko, Nedzad. Compiling word frequency lists: problems of homonymy. Ms. University of Sarajevo and University of Oslo.
Leko 98b
Leko, Nedzad. Some lexical doublets in the Oslo Corpus of Bosnian Tex ts: A comparison with a previous study of doublets. Ms. University of Sarajevo and University of Oslo.
Leko 98c
Leko, Nedzad. Some problems in compiling a frequency dictionary from the Oslo Corpus of Bosnian Texts.Ms. University of Sarajevo and University of Oslo.
Leko 98d
Leko, Nedzad. Polarity Items in Bosnian. Ms. University of Sarajevo and University of Oslo.
Leko 98e
Leko, Nedzad. Recent changes in the Bosnian language as reflected by and documente d from the Oslo Corpus of Bosnian Texts. Ms. University of Sarajevo and University of Oslo.
Santos 98
Santos, Diana. Providing access to language resources through the World Wide Web: the Oslo Corpus of Bosnian Texts. Proceedings of The First International Conference on Language Resources and Evaluation (Granada, 28-30 May 1998), rtf
Szucsich 2002
Szucsich, Luka. Nominale Adverbiale im Russischen. Syntax, Semantik und Informationsstruktur. Otto Sagner Verlag: München (Munich).
Hellman 2005
Hellman, Matias. Znati and um(j)eti in Serbian, Croatian and Bosnian.Grammaticalisation of Habitual Auxiliaries. Slavica Helsingiensia 25. PDF

We would like to know about any further publications using material from our corpus, and eventually make them available from this page.

This is Version 1.1 of the corpus, Version 2.1 of the interface, released on the 20th April 1998.

We gratefully acknowlegde Helge Hauglin's help in debugging CGI programs, Kjetil Rå Hauge's information on fonts and general feedback from an informed user's perspective, and the people at the University of Stuttgart for general technical support concerning CQP.

Our largest debt goes to Nedzad Leko, who was an enthusiastic first user, and provided us with documentation, feedback, and the frequency lists, as well as with the first papers using our corpus.

How to contact us

In Bosnian, please contact Professor Svein Mønnesland, svein.monnesland@east.uio.no,

Svein Mønnesland
Institute for Central European and Oriental Studies,
University of Oslo,
Postboks 1030
Blindern, N-0315 Oslo

+47-2285 6702

+47-2285 4140

In English, you can contact the Text Laboratory by sending mail to tekstlab-post@iln.uio.no. You can also look at the Text Laboratory's contact page for more detailed information.

