In compiling a language
corpus, three isues should be considered beforehead: corpus reliability,
its representativeness and its validity. Corpus reliability is directly
dependent on its size, it representativeness is related to the type of
material included, while its validity is a byproduct of these two. Also,
it should be noted that coprus reliability is related to the aspect of
language that is investigated.
There were two principal sampling criteria in
building up the corpus of Serbian language. The first criterion was that
corpus should include all relevant periods in the development of Serbian
language and to encompass all relevant genres of Serbian written language.
The second criterion is related to the overall size of the Corpus and to
size of its sub samples. Inspection of the documentation suggests that
sampling constituted an important part of the project that was approached
with the utmost care and seriousness. The fact that there are several studies
on sample size and sample reliability (i.e. corpus size and its reliability)
written by the most prominent statisticians of that time (B. Ivanović and
B. Bajšanski), indicates that sample segments and their size were not chosen
randomly. Thus far the original studies were not found, although we know
their titles. Likewise, inspection of authors and books that constitute
the sub samples of Serbian language from 12th to 20th century suggests
clear sampling criteria that will be elaborated in more detail in the forthcoming
paragraphs.
|