Majority Voting with Bidirectional Pre-translation For Bitext Retrieval
Alex Jones, Derry Tanti Wijaya

TL;DR
This paper introduces a novel approach to bitext retrieval using majority voting with bidirectional pre-translation, addressing issues in current methods and demonstrating improvements on benchmarks and NMT tasks.
Contribution
It proposes computationally efficient solutions for bitext mining, highlighting resource effects and addressing dataset issues, with publicly available code and data.
Findings
Improved bitext retrieval performance on Tatoeba benchmark
Effect of resource availability on mining approach choice
Identification of problems with BUCC dataset
Abstract
Obtaining high-quality parallel corpora is of paramount importance for training NMT systems. However, as many language pairs lack adequate gold-standard training data, a popular approach has been to mine so-called "pseudo-parallel" sentences from paired documents in two languages. In this paper, we outline some problems with current methods, propose computationally economical solutions to those problems, and demonstrate success with novel methods on the Tatoeba similarity search benchmark and on a downstream task, namely NMT. We uncover the effect of resource-related factors (i.e. how much monolingual/bilingual data is available for a given language) on the optimal choice of bitext mining approach, and echo problems with the oft-used BUCC dataset that have been observed by others. We make the code and data used for our experiments publicly available.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text and Document Classification Technologies
MethodsLinear Layer · Residual Connection · Softmax · Attention Is All You Need · Multi-Head Attention · Layer Normalization · Dense Connections · Byte Pair Encoding · Refunds@Expedia|||How do I get a full refund from Expedia? · Adam
