masht

THE BELGRADE PROJECT OF MACHINE TRANSLATION CSL

Inside view of the speech synthesizer constructed in 1957 at the Institute for Experimental Phonetics and Speech Pathology, Belgrade		In the middle of 1950s, the project for machine translation was commenced at the Institute for Experimental Phonetics and Speech Pathology in Belgrade. The project included automatization of several elements of speech communication - speech recognition, automatic recording of speech into textual format, a language corpus, translation from one language into another and speech synthesis. Professor Kostić, who directed this project was of the opinion that complete automatic translation from one language into the other was not possible because language could not be totally formalized. Further, due contextual and extralinguistic factors, the machine would have to introduce real world knowledge in order to interpret the context correctly. For these reasons, the problem of automatic speech recognition was constructed on a probabilistic basis and formalizations were maximized. With regards to automatic speech recognition, in
addition to knowledge about distinctive properties of phonemes, the computer memory would need knowledge of probabilities of phonemic co-occurrences. Automatic text recognition relied on the probability of grammatical forms, formalized morphology and syntax and lexemes and their grammatical forms. All of these considerations lead to the conclusion that the problem of machine translation was limited to a rough rendition, with the idea of improving the possibilities of automaticity by further research and technological advancement. A group of experts of different profiles was formed by contacting several institutions, such as the Faculty of Electrical Engineering, the Federal Bureau of Statistics and the Institute for Serbian Language at the Academy of Sciences and in 1954 the project was inaugurated. The acoustic structure of all phonemes in Serbian language was spectrographically described in detail. Machine for speech recognition was constructed and connected to a phonetic typewriter. This system was capable of recognizing and reproducing all Serbian vowels and a few consonants. If the speech sounds were pronounced in isolation and clearly, the machine was able to recognize them without any mistake, while in everyday speech the error rate was about 30%. A speech synthesizer that could produce all Serbian vowels, a few consonants and make several sentences was also developed. Today, it is very difficult to reconstruct the thinking behind which the architecture of the machine translation was conceived. It is certain however, that the problem was approached from two directions – from linguistics and engineering science. The technological and engineering requirements were left to a team of engineers under the management of Prof. Rajko Tomović who, at that time, was one of the leading world experts in the field of computer science. Linguistic considerations proceeded by the formation of a grammatically annotated corpus of Serbian language that was to be the language basis for the machine. Besides word entry probabilities, the corpus would allow approximation of probability of all grammatical forms and probability of all grammatical forms for each word. Although the project was conceived as an interactive conjunction of several parts, each presenting a problem in its own respect, the essence of the whole project was the corpus. Alongside syntax, which was to be specified in the second phase of the project, the corpus presented machine knowledge about the language. Parallel to this work on the corpus of the Serbian language, texts in English, German and French were grammatically annotated as well, with the aim of making pilot corpora that would serve as the material for evaluation of the system of machine translation as a whole. The project was divided into two chronological phases. The first phase assumed the acoustic analysis of the phonemes of Serbian language and the specification of the probabilities of phonemes and phoneme combinations, including parallel work on the speech analyzer, phonetic typewriter and speech synthesizer. The main part of the first phase was the formation of the corpus and specification of probabilities of words and grammatical forms. The second phase, which commenced at the end of 1950s, comprised a description of the syntax of Serbian language that would be partly formalized and partly expressed in terms of probability. At the beginning of 1960s, the project terminated for reasons that were not scientific.