xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource   Languages

Mingda Chen; Kevin Heffernan; Onur \c{C}elebi; Alex Mourachko; Holger; Schwenk

arXiv:2306.12907·cs.CL·June 23, 2023·1 cites

xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages

Mingda Chen, Kevin Heffernan, Onur \c{C}elebi, Alex Mourachko, Holger, Schwenk

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces xSIM++, an enhanced proxy score for evaluating bitext mining in low-resource languages, which better correlates with translation quality and offers detailed error analysis, reducing the need for costly mining procedures.

Contribution

xSIM++ extends previous proxy methods by incorporating rule-based synthetic examples, improving correlation with downstream translation performance in low-resource scenarios.

Findings

01

xSIM++ shows higher correlation with BLEU scores than xSIM.

02

It provides detailed error type reports for better model development.

03

Validated through extensive bitext mining experiments and NMT training.

Abstract

We introduce a new proxy score for evaluating bitext mining based on similarity in a multilingual embedding space: xSIM++. In comparison to xSIM, this improved proxy leverages rule-based approaches to extend English sentences in any evaluation set with synthetic, hard-to-distinguish examples which more closely mirror the scenarios we encounter during large-scale mining. We validate this proxy by running a significant number of bitext mining experiments for a set of low-resource languages, and subsequently train NMT systems on the mined data. In comparison to xSIM, we show that xSIM++ is better correlated with the downstream BLEU scores of translation systems trained on mined bitexts, providing a reliable proxy of bitext mining performance without needing to run expensive bitext mining pipelines. xSIM++ also reports performance for different error types, offering more fine-grained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/LASER
pytorchOfficial

Datasets

jaygala24/xsimplusplus
dataset· 461 dl
461 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Imbalanced Data Classification Techniques · Text and Document Classification Technologies