Criteria

In compiling a language corpus, three issues should be considered beforehand: corpus reliability, its representativeness and its validity. Corpus reliability is directly dependent on its size, it representativeness is related to the type of material included, while its validity is a byproduct of these two factors. It should be noted in addition that corpus reliability is related to the aspect of language under investigation.

1. General considerations: What may be the minimal (or optimal) size of a corpus that will assure its reliability is an empirical rather than an intuitive matter. It could be argued that the issue of reliability with respect to corpus size is heavily dependent on the aspects of language that are investigated. It is far from clear that the same corpus size is required to provide a reliable approximation of the probability distribution of phonemes (graphemes), for example, as opposed to, say, lexical variation. However, to our knowledge there are no systematic statistical studies that might suggest an optimal corpus size for a particular aspect of language. As a consequence, there are no clear empirical criteria what may be the required size that will assure corpus reliability.
With this in mind it is not possible to say whether a corpus of, say, 100,000000 items is reliable or not. It is a simple intuition that is usually posed as an argument. By the same token, it is not possible to say whether a corpus of 11,000000 items (the size of the CSL) is sufficient to provide reliability. Our intention is to make a systematic statistical investigation of the CSL in the near future and establish quantitative norms for stability of probabilities for different aspects of language as a function of corpus size.

2. Why CSLhas 11,000000 words: At this point we don’t know why the Corpus is of the size it is. What we know is that the size of the corpus and its subsamples was not determined arbitrary and was a matter of serious study for the two most prominent statisticians in Yugoslavia in the mid 1950's.
The size of each subsample for the period up to the 20^th century varies between half a million to more than one-and-a-half million items. Thus, for example, each of the subsamples of the old Serbian literature (12^th - 17^th and 18^th century) has approximately half a million words. The size of the subsample of complete works of Vuk St. Karadžić was determined by the amount of published material (about 1,700000 words), while the subsample that includes the second part of the 19^th century contains about 1, 300000 words.
Contemporary language contains about 7,000000 words. It is interesting that the subsamples are approximately of the same size – about 1,400000 words. As noted, at this point it is not clear which criterion was used determine the sub sample size, although this may be clarified when the studies concerning the sample size are found or when we do statistical research on corpus reliability.

1. Criteria for diachronic sampling:Given that the corpus is diachronic, two considerations are of relevance: a) which historical periods should be included, and b) which segments (genres) should be considered to be representative of contemporary Serbian language.
Scholars dealing with old Serbian literature agree that there are three distinct periods in the development of Serbian written language: a) a period from the 12^th century to the end of the 17^th century which is characterized by Serbian-Slavonic language. b) a period between 18^th century to the first part of the 19^th century when the radical reforms were introduced by Vuk St. Karadžić and c) the second half of the 19^th century when Karadžić’s reforms prevailed and linguistic standards, both in written and spoken language, became generally accepted.
Part of the Corpus that encompasses Serbian language up to the 20^th century is divided into four distinct subsamples. The first subsample encompasses the period between the 12^th and 18^th centuries and includes two distinct type of material: a) the lives of Serbian saints, constituting a distinct genre written according to the specified rules and in this respect may be considered as typical literary texts of that period and b) old Serbian charts and letters that are closer to everyday language. By including these two types of material in the sample, both literary and popular (national – i.e. spoken by ordinary people) language are represented, thus covering all relevant forms of Serbian language between the 12^th and 18^th centuries.
The second subsample includes language between the end of the 17^th century to the reforms introduced by Vuk St. Karadžić. This period is characterized by a dramatic absence of linguistic and orthographic standards and various influences that were not treated systematically. As a consequence, authors from that period used somewhat idiosyncratic orthography, vocabulary and grammar. The included authors represent all forms of this variation in the usage of the Serbian language, making the whole subsample representative for the respective period.
A distinct part of the sample of Serbian language to the 20^th century are the complete works of Vuk St. Karadžić. There are several reasons why Karadžić has been included in full. The first and the most important reason is that Karadžić introduced radical reforms both in Serbian orthography and linguistic standards.
The work of Karadžić is a turning point in the development of Serbian written and spoken language. However, Karadžić was not only a reformer of Serbian language. He also collected Serbian national poems, proverbs and stories, translated the New Testament into Serbian, made first Serbian language dictionary, wrote the first primer and the first Serbian language grammar, wrote a number of linguistic, ethnological, geographical and historical studies and had extensive correspondence with themost prominent people in Europe of that time.Thus, Karadžić's complete works encompass various aspects of Serbian language, spanning Serbian national poetry and proverbs to his personalcorrespondence. This allows for a number of comparisons, on the one hand, including the different historical segments of the Serbian national language and, on theother hand, the language of Vuk St. Karadžić himself. Likewise, this subsample allows for detailed tracing of the changes consequent upon Karadžić’s reforms. The fourth subsample refers to language from the second part of the 19^th century and includes authors that adopted Karadžić’s reforms. This sub sample includes complete works of Branko Radičević, Marko Miljanov, Đura Jakšić, Petar Petrović – Njegoš, Jovan Jovanović – Zmaj and one essay by Laza Kostić. These six authors are not only among the most prominent figures in Serbian literature, they also cover all genres of 19^th century Serbian literature. Thus, for example, Branko Radičević was one of the first poets to adopt Karadžić’s reforms, while the writings of Marko Miljanov resemble spoken language from the end of the 19^th century. Thecomplete works of Njegoš represent a specific subsample because, in addition to “Gorski Vijenac” and “Luča Mikrokozma”, two ofthe most prominent works written in the Serbian language, these works include his personalcorrespondences. Đura Jakšić made significant contributions within different literary genres, thus allowing for their comparison within a single author. This is to some extent also true for Jovan Jovanović-Zmaj. The writings ofLaza Kostić, included in the corpus, are representative of literary criticism of that period.
In addition to these four periods in the development of Serbian language there was yet one more subsample which, unfortunately, was not grammatically tagged, and therefore not transferred into the electronic format. This sub sample included Dubrovnik literature from the 16^th and 17^th centuries.

b. Sampling criteria for the contemporary language: Contemporary Serbian language is represented by five distinct subsamples (prose, daily press, scientific literature, poetry, and political texts), each of them being a distinct genre of written language. It is hard to think of an additional genre of written language that may enhance the representativness of the sample. All items included into the corpus (with few exceptions) were written between 1945 and 1957. It could be argued that this may challenge the status of the material as not being representative for the contemporary language. This issue will be discussed later.
The subsample of prose includes novels, essays, literary criticisms and polemics. Daily press includes Belgrade’s daily newspaper “Politika” which was considered to be a broadsheet of the highest language standards. This subsample was divided into three distinct periods that allow for the statistical investigation of stable and variable aspects of language across a somewhat restricted time span of 12 years. Scientific literature encompassed a number of scientific disciplines, enabling an insight into the idiosyncrasies of language use across various different scientific fields, on the one hand, and the contrast between language used in scientific literature and other domains of written language on the other. The motivation for introducing poetical works resides in the fact that poetic vocabulary is often richer than those encountered in other genres. Finally, political texts were included because it was believed that they have distinct properties that are uncommon within other genres.
It should be emphasized that the criterion for including a particular item into the corpus was not its literary or scientific value.

c. Is the CSLoutdated?: It could be argued that the sample of the contemporary language is not reliable due to the fact that selected items are almost half a century old. Again, this claim requires empirical evaluation. At this point we do not know what changes took place in the course of the last few decades. On the other hand, subsamples of contemporary language differ with respect to potential changes over time. Intuitively, we can say that the language of prose and poetry did not change much. In contrast, thelanguage of the daily press seems to be subject to greater changes, in particular thepersonal names, places and terms specific for a particular period. Likewise, the vocabulary of scientific literature and political texts also changed over time. However, the claim that the CSL represents an outdatedsample of contemporary language relies on pure intuition and it is a matter of statistical evaluation to find out what percentage of vocabulary is stable and how is this stability dependent on the type of material. Only then it could be argued that some parts of the Corpus are outdated.