About the creation of a parallel bilingual corpora of web-publications

D.V. Lande; V.V. Zhygalo

arXiv:0807.0311·cs.CL·July 3, 2008

About the creation of a parallel bilingual corpora of web-publications

D.V. Lande, V.V. Zhygalo

PDF

Open Access

TL;DR

This paper presents an algorithm for creating a parallel bilingual corpus of web publications using key words and automated translation, resulting in a corpus of about 30,000 documents in Russian and Ukrainian.

Contribution

The paper introduces a novel algorithm for automated creation of bilingual corpora based on key words and morphological analysis, integrated into a content-monitoring system.

Findings

01

Created a bilingual corpus of 30,000 documents

02

Developed an algorithm using morphological dictionaries and statistical rules

03

Integrated the algorithm into an existing content-monitoring system

Abstract

The algorithm of the creation texts parallel corpora was presented. The algorithm is based on the use of "key words" in text documents, and on the means of their automated translation. Key words were singled out by means of using Russian and Ukrainian morphological dictionaries, as well as dictionaries of the translation of nouns for the Russian and Ukrainianlanguages. Besides, to calculate the weights of the terms in the documents, empiric-statistic rules were used. The algorithm under consideration was realized in the form of a program complex, integrated into the content-monitoring InfoStream system. As a result, a parallel bilingual corpora of web-publications containing about 30 thousand documents, was created

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLexicography and Language Studies · Literature, Language, and Rhetoric Studies · linguistics and terminology studies