In
compiling a language corpus, three issues should be considered beforehand:
corpus reliability, its representativeness and its validity. Corpus reliability
is directly dependent on its size, it representativeness is related to
the type of material included, while its validity is a byproduct of these
two factors. It should be noted in addition that corpus reliability is
related to the aspect of language under investigation.
There were two principal sampling criteria
in building up the corpus of Serbian language. The first criterion was
that corpus should include all relevant periods in the development of the
Serbian language and to encompass all relevant genres of Serbian written
language. The second criterion is related to the overall size
of the corpus and to size of its subsamples. Inspection of the documentation
suggests that sampling constituted an important part of the project, which
was approached with the utmost care and consideration. The fact that there
are several studies on sample size and sample reliability (i.e. corpus
size and its reliability) written by the most prominent statisticians of
that time (B. Ivanović and B. Bajšanski), indicates that sample segments
and their size were not chosen at random. Thus far, these original studies
have not been found, although we know their titles. Likewise, inspection
of authors and books that constitute the sub samples of Serbian language
from 12th to 20th century suggests a clear sampling
criteria that will be elaborated in more detail in the forthcoming paragraphs
A. Criteria for determining the size of the sample and the sub samples
1.
General considerations:
What may be the minimal (or optimal) size of a corpus that will assure
its reliability is an empirical rather than an intuitive
matter. It could be argued that the issue of reliability with respect to
corpus size is heavily dependent on the aspects of language that are investigated.
It is far from clear that the same corpus size is required to provide a
reliable approximation of the probability distribution of phonemes (graphemes),
for example, as opposed to, say, lexical variation. However, to our knowledge
there are no systematic statistical studies that might suggest an optimal
corpus size for a particular aspect of language. As a consequence, there
are no clear empirical criteria what may be the required size that will
assure corpus reliability.
With this in mind it is not possible to say whether a corpus of, say,
100,000000 items is reliable or not. It is a simple intuition that is usually
posed as an argument. By the same token, it is not possible to say whether
a corpus of 11,000000 items (the size of the CSL)
is sufficient to provide reliability. Our intention is to make a systematic
statistical investigation of the CSL in
the near future and establish quantitative norms for stability of probabilities
for different aspects of language as a function of corpus size.
2.
Why CSLhas
11,000000 words:
At this point we don’t know why the Corpus is of the size it is. What we
know is that the size of the corpus and its subsamples was not determined
arbitrary and was a matter of serious study for the two most prominent
statisticians in Yugoslavia in the mid 1950's.
The size of each subsample for the period up to the 20th
century varies between half a million to more than one-and-a-half million
items. Thus, for example, each of the subsamples of the old Serbian literature
(12th - 17th and 18th century) has approximately
half a million words. The size of the subsample of complete works of Vuk
St. Karadžić was determined by the amount of published material (about
1,700000 words), while the subsample that includes the second part of the
19th century contains about 1, 300000 words.
Contemporary language contains about 7,000000 words. It is interesting
that the subsamples are approximately of the same size – about 1,400000
words. As noted, at this point it is not clear which criterion was used
determine the sub sample size, although this may be clarified when the
studies concerning the sample size are found or when we do statistical
research on corpus reliability.
B. Criteria for the choice of periods and authors
1.
Criteria for diachronic sampling:Given
that the corpus is diachronic, two considerations are of relevance: a)
which historical periods should be included, and b) which segments (genres)
should be considered to be representative of contemporary Serbian language.
Scholars dealing
with old Serbian literature agree that there are three distinct periods
in the development of Serbian written language: a) a period from the 12th
century to the end of the 17th century which is characterized
by Serbian-Slavonic language. b) a period between 18th century
to the first part of the 19th century when the radical reforms
were introduced by Vuk St. Karadžić and c) the second half of the 19th
century when Karadžić’s reforms prevailed and linguistic standards, both
in written and spoken language, became generally accepted.
Part of the Corpus that encompasses Serbian language up to the 20th
century is divided into four distinct subsamples. The first subsample encompasses
the period between the 12th and 18th centuries and
includes two distinct type of material: a) the lives of Serbian saints,
constituting a distinct genre written according to the specified rules
and in this respect may be considered as typical literary texts of that
period and b) old Serbian charts and letters that are closer to everyday
language. By including these two types of material in the sample, both
literary and popular (national – i.e. spoken by ordinary people)
language are represented, thus covering all relevant forms of Serbian language
between the 12th and 18th centuries.
The second subsample includes language between the end of the 17th
century to the reforms introduced by Vuk St. Karadžić. This period is characterized
by a dramatic absence of linguistic and orthographic standards and various
influences that were not treated systematically. As a consequence, authors
from that period used somewhat idiosyncratic orthography, vocabulary and
grammar. The included authors represent all forms of this variation in
the usage of the Serbian language, making the whole subsample representative
for the respective period.
A distinct part of the sample of Serbian language to the 20th
century are the complete works of Vuk St. Karadžić. There are several reasons
why Karadžić has been included in full. The first and the most important
reason is that Karadžić introduced radical reforms both in Serbian orthography
and linguistic standards.
The work of Karadžić is a turning point in the development of Serbian
written and spoken language. However, Karadžić was not only a reformer
of Serbian language. He also collected
Serbian national poems, proverbs and stories, translated the New Testament
into Serbian, made first Serbian language dictionary, wrote the first primer
and the first Serbian language grammar, wrote a number of linguistic, ethnological,
geographical and historical studies and had extensive correspondence with
themost prominent people in Europe
of that time.Thus,
Karadžić's complete works encompass various aspects of Serbian language,
spanning Serbian national poetry
and proverbs to his personalcorrespondence.
This allows for a number of comparisons, on the one hand, including the
different historical segments of the Serbian national language and, on
theother hand, the language of Vuk
St. Karadžić himself. Likewise, this subsample allows for detailed tracing
of the changes consequent upon Karadžić’s reforms. The fourth subsample
refers to language from the second part of the 19th century
and includes authors that adopted Karadžić’s reforms. This sub sample includes
complete works of Branko Radičević, Marko Miljanov, Đura Jakšić, Petar
Petrović – Njegoš, Jovan Jovanović – Zmaj and one essay by Laza Kostić.
These six authors are not only among the most prominent figures in Serbian
literature, they also cover all genres of 19th century Serbian
literature. Thus, for example, Branko Radičević was one of the first poets
to adopt Karadžić’s reforms, while the writings of Marko Miljanov resemble
spoken language from the end of the 19th century. Thecomplete
works of Njegoš represent a specific subsample because, in addition to
“Gorski Vijenac” and “Luča Mikrokozma”, two ofthe
most prominent works written in the Serbian language, these works include
his personalcorrespondences. Đura
Jakšić made significant contributions within different literary genres,
thus allowing for their comparison within a single author. This is to some
extent also true for Jovan Jovanović-Zmaj. The writings
ofLaza Kostić, included in the corpus,
are representative of literary criticism of that period.
In addition to these four periods in the development of Serbian language
there was yet one more subsample which, unfortunately, was not grammatically
tagged, and therefore not transferred into the electronic format. This
sub sample included Dubrovnik literature from the 16th and 17th
centuries.
b.
Sampling criteria for the contemporary language:
Contemporary Serbian language is represented by five distinct subsamples
(prose, daily press, scientific literature, poetry, and political texts),
each of them being a distinct genre of written language. It is hard to
think of an additional genre of written language that may enhance the representativness
of the sample. All
items included into the corpus (with few exceptions) were written between
1945 and 1957. It could be argued that this may challenge the status of
the material as not being representative for the contemporary language.
This issue will be discussed later.
The subsample of prose includes novels, essays, literary criticisms
and polemics. Daily press includes Belgrade’s daily newspaper “Politika”
which was considered to be a broadsheet of the highest language standards.
This subsample was divided into three distinct periods that allow for the
statistical investigation of stable and variable aspects of language across
a somewhat restricted time span of 12 years. Scientific literature encompassed
a number of scientific disciplines, enabling an insight into the idiosyncrasies
of language use across various different scientific fields, on the one
hand, and the contrast between language used in scientific literature and
other domains of written language on the other. The motivation for introducing
poetical works resides in the fact that poetic vocabulary is often richer
than those encountered in other genres. Finally, political texts were included
because it was believed that they have distinct properties that are uncommon
within other genres.
It should be emphasized that the criterion for including a particular
item into the corpus was not its literary or scientific value.
c.
Is the CSLoutdated?:
It could be argued that the sample of the contemporary language is not
reliable due to the fact that selected items are almost half a century
old. Again, this claim requires empirical evaluation. At this point we
do not know what changes took place in the course of the last few decades.
On the other hand, subsamples of contemporary language differ with respect
to potential changes over time. Intuitively, we can say that the language
of prose and poetry did not change much. In contrast, thelanguage
of the daily press seems to be subject to greater changes, in particular
thepersonal names, places and terms
specific for a particular period. Likewise, the vocabulary of scientific
literature and political texts also changed over time. However, the claim
that the CSL represents
an outdatedsample of contemporary
language relies on pure intuition and it is a matter of statistical evaluation
to find out what percentage of vocabulary is stable and how is this stability
dependent on the type of material. Only then it could be argued that some
parts of the Corpus are outdated. |