Some research on the Oslo Corpus of Bosnian Texts


During my stay in Oslo I have investigated recent changes in the Bosnian language as documented in the Oslo Corpus of Bosnian Texts. This corpus has been compiled at the University of Oslo as a joint project between the Department for East European and Oriental Studies and the Text Laboratory. The corpus contains approximately 1.5 million words, and comprises several different genres: fiction (novels and short stories), essays, children’s stories, folklore, islamic texts, legal texts, and newspapers and journals. The texts, written by authors from Bosnia and Herzegovina, have for the most part been published in the 1990s. The corpus provides a new and different basis for research into the language of Bosnia and Herzegovina.

Nedzad Leko

Nedzad Leko

I have investigated lexical doublets, negative polarity items and other grammatical phenomena. I have also compiled a frequency list containing the 1000 most frequent word forms in the Oslo Corpus. Let me say something more about the latter.

The Oslo Corpus of Bosnian Texts enables linguists to compile the frequency list of word forms. Such frequency lists exist for many languages and represent a useful tool for scientific and pedagogical purposes. However, although it is simple to determine the frequency list of word forms, it is extremely difficult and time consuming, to determine the exact number of lexemes which happen to have the same morphological form. Establishing the exact number of occurrences of a single lexeme is also problematic in highly inflected languages with rich morphology, like Bosnian. All different morphological realisations of that lexeme depending on gender, number, person, case, tense, or aspect should be taken into account.

Obviously, homonymy/homography represents a serious problem for compiling a frequency dictionary, especially when the form in question has a large number of occurrences, so that examining every single example in the corpus becomes too demanding task.

The list of 1000 most frequent word forms found in the Oslo Corpus of Bosnian Texts was compiled and posted on the Internet, as a part of Web pages devoted to the corpus. The Text Laboratory has my papers that are based on the corpus.

Nedzad Leko


[Neste | Innhold | Tekstlab]


18. desember 1998, AN, <anders.noklestad@ilf.uio.no>