The Nordic Dialect Corpus Search Interface

Example Searches

On this page a few example searches will be presented, along with a quick intro to dealing with search results. The examples are meant to illustrate some basic functions of the Glossa interface. A more thorough walk-through of all of the functions the interface has to offer is found in The full User Manual. See also The Search Interface Documentation.

Contents

Example 1 - Searching for a Specific Word Form in a Specific Language
Example 2 - Searching for Phrases, Lemmas and Classes of Words Defined by Grammatical Criteria
Example 3 - Searching for Compounds Using Parts of Words
Example 4 - Searching for Semi-Phonetic Alternants of Specific Words
A Quick Intro to Dealing with Search Results

Example 1 - Searching for a Specific Word Form in a Specific Language

The first example will show how to search for a specific word form in a specific language. To illustrate, we are going to look at a word that varies a lot across different Norwegian dialects, namely the first person pronoun "jeg".

To start out, we'll need to type the word into the word box in the linguistic search field (top red box, figure 1 below).

There are also a couple of more advanced functions that need to be given attention, and that are worth keeping in mind for all searches.

First of all, the maximum number of hits is limited to 2 000 by default. When searching for very frequent words (like e.g. "jeg") this number might be to small to include all occurrences in the corpus. If it is of interest to include all occurrences, this number should therefore be adjusted up. In this example we'll set the number to 200 000 (middle red box, figure 1 below).

Second, a search in the corpus will by default include dialects from all languages represented, i.e. Danish, Faroese, Icelandic, Norwegian and Swedish. In other words, if the query matches a word in the standard orthography of any one of these languages that is actually found in the corpus, it will be included in the search results. The first person pronoun is written "jeg" not only in Norwegian, but also in Danish (and partially in the Faroese transcriptions). The query "jeg" will therefore give "unwanted" Danish and Faroese hits in addition to the Norwegian hits we are interested in for this example.

For this (and the following) example searches, we are therefore going to limit the search to Norwegian dialects only. This is done by expanding the country table (bottom red box, figure 1), and then double-clicking Norway. Norway will then move from the left column (excluded) to the right column (included). To deselect Norway, double click it again.

Using the Metadata Specification Field, it's possible to limit searches to geographic areas and specific informants. It's also possible to include or exclude informant groups based on age group and sex, the year the recording took place and what genre the recording is (interview or conversation). If you want to check which informants that are going to be included in the search based on your current selections in the Metadata Search Field, you can press the show informants button on the right hand side. A new window will then open showing you a table of the included informants with available metadata. More information on data collection is available here.

When all the variables are entered, your interface will look like the one showed figure 1 below. You are then ready to press the search corpus button. A new window will then appear with the search results. A brief guideline on how to handle search results is found below. This search will only give hits where the transcription includes the specific word form "jeg". In other words, the oblique form "meg" will not be included among the hits. In Example 2, you will see how to search for lemmas and classes of words defined by grammatical criteria.

Figure 1 - Glossa search interface ready for example search 1

Back to top.

Example 2 - Searching for Phrases, Lemmas and Classes of Words Defined by Grammatical Criteria

This example will show how we can search for phrases, lemmas and grammatical categories. We will illustrate this by looking at Norwegian collocations consisting of a verb followed by the negative adverb "ikke". In such collocations we often find phonetic/phonological alternations of either or both words.

In order to add an extra word box to the search, thus making it a search for a multi-word phrase instead of a search for a single word, click the plus-button (+) on the right side of the word box. You are now presented with two such boxes instead of one. The extra word box may be removed again by clicking the minus-button (-).

Now for this example, the first word of the phrase should be a verb. Since the transcriptions are morphologically tagged, it is possible to search for a class of words defined by grammatical criteria. These criteria are found in the expanding menu directly below the word box. For this example, we find the verb criterion under pos ("part of speech"). Click it, and it will show up in a new white box below the word box, cf. figure 2 below. Selected criteria may be removed again by double clicking them.

The tagging process itself has been done automatically, which gives some shortcomings. For example, the corpus unfortunately includes some words that are incorrectly tagged. It's therefore important to remember that searching for specific word forms always will retrieve the hits you expect, while that is not necessarily the case when the search includes classes of words defined by grammatical criteria. For a discussion on this, and an overview of the searchable criteria, cf. the full user manual (coming soon).

Continuing with the example search, we'll simply enter "ikke" in the second word box. This gives us a query consisting of a two word collocation where the first word is defined as belonging to the grammatical category verb and the second word is "ikke".

Now, say we were only interested in looking at alternations in collocations where the verb is in the present tense. This can easily be done by adding another criterion, namely pres ("present") that we find under temp ("tempus"/"tense"). If we now set the max results number to 200 000 and narrow the search only to include Norwegian dialects like we did in example 1 above, we are ready to press the search corpus button. A new window will appear with the search results (more about handling results below).

Figure 2 - Selecting criteria: pos ("part of speech") - verb

Let's say that through the search above, we find particularly interesting alternations in collocations containing different forms of one specific verb, e.g. "skulle" ("should"). We therefore want to study all collocations with all forms of this particular verb more thoroughly. In order to do this, we can perform a lemma search. This is also done using the criteria menu. First of all, we need to enter the dictionary form of the verb into the first word box. Then we find and click the lemma criterion under word in the criteria menu, cf. figure 3. It will pop up in a new white field under the search field itself, in the same way as the criteria verb and pres described above (to remove the other criteria, simply double click them). We are now ready to perform a new search with a query consisting of the lemma "skulle", giving us all forms of this verb, followed by the word form "ikke".

Note that only Norwegian and Danish transcriptions are lemmatized for now, so using lemma based queries in any of the other languages unfortunately won't work.

Figure 3 - Selecting criteria: word - lemma

Back to top.

Example 3 - Searching for Compounds Using Parts of Words

A practical way to search for compounds is using the different parts of words criteria. For this example, let's say we're interested in finding Norwegian compounds in which one of the elements is "gutt" ("boy"). In compounds with two elements, "gutt" may then be initial or final, and if we find compounds with more than two elements, it may even be medial.

First, we'll see how we can find compounds where "gutt" is the initial element. We start out by entering gutt into the word box in the linguistic search field (cf. example 1). Then we open the criteria menu and select word, then start of word. We now have a query that will retrieve all words starting with "gutt", as well as all forms of the word "gutt" itself. Since we are interested only in compounds, we want to filter out the occurrences of the lemma "gutt" (i.e. the word forms "gutt, gutten, gutter, guttene"). Since the Norwegian transcriptions are lemmatized, this is easily done by excluding the lemma "gutt" from the query using the add negated lemma criterion (also under word). The linguistic search field will now look like the one in figure 4a below. If we were searching for compounds in one of the languages that aren't lemmatized, we would have to exclude every possible word form of the lemma in question using the add negated word form criterion to achieve the same result.

The same method can be used to find compounds where "gutt" is the medial or final element, but here we need to perform two searches to be sure that we find all possible hits. First, to find compounds where "gutt" is the final element (in its indefinite singular form), we simply perform a search identical to the one described over, only using the end of word criterion instead of the start of word criterion. For this search, the linguistic search field will look like the one in figure 4b below. Second, to find compounds where "gutt" in a medial element, or compounds where "gutt" is final and in any other form than the indefinite singular ("-gutten, -gutter, -guttene"), we can drop using the add negated lemma criterion and only make use of the within word criterion. For this search, the linguistic search field will look like the one in figure 4c below.

Knowing some basic regular expressions gives you a lot more flexibility when searching for parts of words. You can read more about regular expressions in the full user manual (coming)
.

Figure 4 - Searching for compounds where one of the elements is "gutt"

Back to top.

Example 4 - Searching for Semi-Phonetic Alternants of Specific Words

In example 1, we used the orthographic word form of the Norwegian first person pronoun "jeg" to search for phonetic variation in Norwegian dialects. In this example, we'll see that it's also possible to go the other way around. We can in other words search for known semi-phonetic alternants of words to find out how frequent they are or what their geographical distribution might be. This is only possible for Norwegian dialects, since these are the only that are transcribed both orthographically and semi-phonetically. Read more about the transcription of the Norwegian recordings here, and read the transcription guidelines here (especially section 3.4, Norwegian only).

For this example, we're interested in finding the distribution for a typical South Eastern Norwegian variant of the negative adverb "ikke". The form may vary in the semi-phonetic transcription, depending on vowel quality and whether we have the full form or the clitic, but common to all these forms is that they end in "nte".

In order to perform this search then, we need to enter the query "nte" into the word box in the linguistic search field, just like we would have if we were searching for an orthographic word form. But now we also have to specify that the search is to be performed in the semi-phonetic transcriptions. To do this, we have to select the phonetic criterion from the criteria menu. We also need to select the the end of word criterion that we find in the word subsection of the criteria menu in order to specify that our query consists of an ending common to several possible word forms. Finally, we want to specify that what we're looking for is alternants of the orthographic word form "ikke". This is done by entering "ikke" in the pop-up window that appears when we select the criterion word - add word form. If we drop this last step, our query will return hits of all words ending in "nte" in the semi-phonetic transcription.

In figure 5 below, we see the linguistic search field prepared for this example search. Note that it's also possible to combine the phonetic criterion with other criteria, like e.g. parts of speech.

Figure 5 - Searching part of a phonetic alternant of a specific word form

Back to top.

A Quick Intro to Dealing with Search Results

The search results will always open in a new window like the one showed Figure 6 below. Here we've searched for "jeg", the first person pronoun in Norwegian dialects, like in example 1 above. For this section though, the maximum number of hits is set to 200 and the hits are randomized. At the top of the page, you'll find the total number of informants included in the search (505). A bit further down, you'll see the number of hits available to you and how many hits there are in total (200 and 52 630). The hits are presented 20 per page by default, but this is adjustable (cf. the full user manual - coming soon). Each hit is presented in the form of a segment containing the word (or phrase) searched for, which in turn is bolded out. The code of the informant having uttered the segment is in the left column, and here you also find three clickable buttons. The first one () opens information about the informant in a new window. The two following buttons ( and ) open video and audio or just audio for the current segment on top of the results page. As is shown in Figure 6, not all recordings include video, and in such cases there will just be an audio button available. The media player that opens when clicking the either of these buttons, is shown in figure 7. By default this is a Flash player, but if there are problems using Flash, a QuickTime player may also be used. The segment currently playing is highlighted in orange on the right hand side. It is possible to get a wider context to the right and/or to the left by using the sliders directly above the play button. In Figure 7, the left context has been adjusted to include the three preceding segments.

Figure 6 - Search results page

Figure 7 - Media player

There are several more advanced functions useful for handling the search results available through the action drop down menu. Among other things, it's possible to count results, save results for later reference and export results to various editable formats. These functions aren't dealt with here, but described in detail in the full user manual (coming soon). We will, on the other hand, show how to use the map function. Basically, this function shows your hits on a map. When the search is done in Norwegian transcriptions (that are transcribed both orthographically and semi-phonetically), you are also shown phonetic variation (if you search for a multi-word phrase, phonetic variation is only shown for the first word). The goal of example 1 was to find variation in pronunciation of the first person pronoun in Norwegian dialects. If we use the 200 hits from the result page shown in Figure 6 above as a point of departure, we can then click the map button to get a simple overview shown in Figure 8 below. The recording locations from which the search has returned hits are shown as red dots on the map, and the different phonetic realizations are shown in a list to the right. By selecting a color each phonetic realization may be marked out on the map. To illustrate this, all forms that have undergone vowel breaking are marked out with red in the figure.

Figure 8 - Map over semi-phonetic "jeg" realizations

Back to top.