Unsupervised comparable corpora preparation and exploration for bi-lingual translation equivalents
Krzysztof Wo{\l}k, Krzysztof Marasek

TL;DR
This paper enhances comparable corpora mining for bilingual translation by improving algorithms, tuning, and GPU acceleration, leading to better data quality and translation performance across multiple domains.
Contribution
It introduces improved comparison algorithms, a tuning script, and GPU acceleration for mining comparable corpora, boosting translation data quality and quantity.
Findings
Improved mining algorithms increased data quality.
GPU acceleration reduced processing time.
Enhanced data improved translation accuracy.
Abstract
The multilingual nature of the world makes translation a crucial requirement today. Parallel dictionaries constructed by humans are a widely-available resource, but they are limited and do not provide enough coverage for good quality translation purposes, due to out-of-vocabulary words and neologisms. This motivates the use of statistical translation systems, which are unfortunately dependent on the quantity and quality of training data. Such systems have a very limited availability especially for some languages and very narrow text domains. In this research we present our improvements to current comparable corpora mining methodologies by re- implementation of the comparison algorithms (using Needleman-Wunch algorithm), introduction of a tuning script and computation time improvement by GPU acceleration. Experiments are carried out on bilingual data extracted from the Wikipedia, on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text and Document Classification Technologies
