Tuned and GPU-accelerated parallel data mining from comparable corpora

Krzysztof Wo{\l}k; Krzysztof Marasek

arXiv:1509.08639·cs.CL·September 30, 2015·1 cites

Tuned and GPU-accelerated parallel data mining from comparable corpora

Krzysztof Wo{\l}k, Krzysztof Marasek

PDF

Open Access

TL;DR

This paper enhances the Yalign data mining method for parallel corpora by reimplementing its comparison algorithm, adding tuning scripts, and leveraging GPU acceleration to improve performance across diverse text domains.

Contribution

It introduces GPU-accelerated, tuned reimplementation of Yalign, enabling more efficient parallel data mining from Wikipedia and other sources.

Findings

01

Improved mining speed with GPU acceleration

02

Effective extraction of bi-data from Wikipedia

03

Enhanced performance across multiple text domains

Abstract

The multilingual nature of the world makes translation a crucial requirement today. Parallel dictionaries constructed by humans are a widely-available resource, but they are limited and do not provide enough coverage for good quality translation purposes, due to out-of-vocabulary words and neologisms. This motivates the use of statistical translation systems, which are unfortunately dependent on the quantity and quality of training data. Such has a very limited availability especially for some languages and very narrow text domains. Is this research we present our improvements to Yalign mining methodology by reimplementing the comparison algorithm, introducing a tuning scripts and by improving performance using GPU computing acceleration. The experiments are conducted on various text domains and bi-data is extracted from the Wikipedia dumps.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Algorithms and Data Compression