xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages
Mingda Chen, Kevin Heffernan, Onur \c{C}elebi, Alex Mourachko, Holger, Schwenk

TL;DR
This paper introduces xSIM++, an enhanced proxy score for evaluating bitext mining in low-resource languages, which better correlates with translation quality and offers detailed error analysis, reducing the need for costly mining procedures.
Contribution
xSIM++ extends previous proxy methods by incorporating rule-based synthetic examples, improving correlation with downstream translation performance in low-resource scenarios.
Findings
xSIM++ shows higher correlation with BLEU scores than xSIM.
It provides detailed error type reports for better model development.
Validated through extensive bitext mining experiments and NMT training.
Abstract
We introduce a new proxy score for evaluating bitext mining based on similarity in a multilingual embedding space: xSIM++. In comparison to xSIM, this improved proxy leverages rule-based approaches to extend English sentences in any evaluation set with synthetic, hard-to-distinguish examples which more closely mirror the scenarios we encounter during large-scale mining. We validate this proxy by running a significant number of bitext mining experiments for a set of low-resource languages, and subsequently train NMT systems on the mined data. In comparison to xSIM, we show that xSIM++ is better correlated with the downstream BLEU scores of translation systems trained on mined bitexts, providing a reliable proxy of bitext mining performance without needing to run expensive bitext mining pipelines. xSIM++ also reports performance for different error types, offering more fine-grained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Imbalanced Data Classification Techniques · Text and Document Classification Technologies
