Katarina Ivanović:
Selfportrait (1841)
|
|
Corpus of Serbian Language CSL
consists
of: a. grammatically annotated text, b. series of frequency dictionaries
and c. set of probability tables for grammatical forms and phonemes and
phonemic combinations.
a. Grammatically annotated text: The
text is given in its original form, each word being tagged for its grammatical
status, number of phonemes, phonological structure and onomastic status.
In addition, the interpunction, the beginning and the end of a sentence
and paragraph are also annotated.the beginning and the end of a sentence
and paragraph are also annotated.the end of a sentence and paragraph are
also annotated.and paragraph are also paragraph are also annotated.the
beginning and |
b. Frequency dictionaries:
For
each sub sample a series of frequency dictionaries have been compiled.
Thus, for example, for the contemporary Serbian language frequency dictionaries
are available at the level of: a. book, b. author, c. sub sample (e.g.
poetry or daily press) and d. sub sample as a whole (e.g. contemporary
language or language between XII and XVII century). In addition to probability
of an entry, frequency dictionaries also contain probabilities of grammatical
forms for a given entry, number of graphemes and phonological structure
for each word.
c. Probability matrices: The
CSLcontains
probability matrices of all grammatical forms of Serbian language, as well
as probability matrices of phonemes and phonemic combinations. These matrices
are given at all levels of potential analyses – from the level of
a book to the level of a sub sample (e.g. contemporary language). The material
is given in a format that is easily transferable into any standard statistical
package.
At this moment the following is available: grammatically
annotated text for all sub samples, frequency dictionaries at all levels
for the contemporary Serbian language and probability matrices for grammatical
forms and phonological structure at all levels for the contemporary Serbian
language. Work on frequency dictionaries and probability matrices for other
sub samples is in progress.
|