|Corpus of Serbian Language CSL
of: a. grammatically annotated text, b. series of frequency dictionaries
and c. set of probability tables for grammatical forms and phonemes and
a. Grammatically annotated text: The text is given in its original form, each word being tagged for its grammatical status, number of phonemes, phonological structure and onomastic status. In addition, the interpunction, the beginning and the end of a sentence and paragraph are also annotated.the beginning and the end of a sentence and paragraph are also annotated.the end of a sentence and paragraph are also annotated.and paragraph are also paragraph are also annotated.the beginning and
|b. Frequency dictionaries:
each sub sample a series of frequency dictionaries have been compiled.
Thus, for example, for the contemporary Serbian language frequency dictionaries
are available at the level of: a. book, b. author, c. sub sample (e.g.
poetry or daily press) and d. sub sample as a whole (e.g. contemporary
language or language between XII and XVII century). In addition to probability
of an entry, frequency dictionaries also contain probabilities of grammatical
forms for a given entry, number of graphemes and phonological structure
for each word.
c. Probability matrices: The CSLcontains probability matrices of all grammatical forms of Serbian language, as well as probability matrices of phonemes and phonemic combinations. These matrices are given at all levels of potential analyses – from the level of a book to the level of a sub sample (e.g. contemporary language). The material is given in a format that is easily transferable into any standard statistical package.
At this moment the following is available: grammatically annotated text for all sub samples, frequency dictionaries at all levels for the contemporary Serbian language and probability matrices for grammatical forms and phonological structure at all levels for the contemporary Serbian language. Work on frequency dictionaries and probability matrices for other sub samples is in progress.