Tuned and GPU-accelerated parallel data mining from comparable corpora
Krzysztof Wo{\l}k, Krzysztof Marasek

TL;DR
This paper enhances the Yalign data mining method for parallel corpora by reimplementing its comparison algorithm, adding tuning scripts, and leveraging GPU acceleration to improve performance across diverse text domains.
Contribution
It introduces GPU-accelerated, tuned reimplementation of Yalign, enabling more efficient parallel data mining from Wikipedia and other sources.
Findings
Improved mining speed with GPU acceleration
Effective extraction of bi-data from Wikipedia
Enhanced performance across multiple text domains
Abstract
The multilingual nature of the world makes translation a crucial requirement today. Parallel dictionaries constructed by humans are a widely-available resource, but they are limited and do not provide enough coverage for good quality translation purposes, due to out-of-vocabulary words and neologisms. This motivates the use of statistical translation systems, which are unfortunately dependent on the quantity and quality of training data. Such has a very limited availability especially for some languages and very narrow text domains. Is this research we present our improvements to Yalign mining methodology by reimplementing the comparison algorithm, introducing a tuning scripts and by improving performance using GPU computing acceleration. The experiments are conducted on various text domains and bi-data is extracted from the Wikipedia dumps.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Algorithms and Data Compression
