OneAligner: Zero-shot Cross-lingual Transfer with One Rich-Resource Language Pair for Low-Resource Sentence Retrieval
Tong Niu, Kazuma Hashimoto, Yingbo Zhou, Caiming Xiong

TL;DR
OneAligner is a novel sentence alignment model that, when trained on a single rich-resource language pair, effectively transfers to low-resource languages, achieving state-of-the-art results with minimal data and demonstrating the importance of data size over language pair selection.
Contribution
The paper introduces OneAligner, a zero-shot cross-lingual sentence alignment model trained on one language pair that generalizes well to low-resource languages, outperforming previous models with less data.
Findings
Achieves state-of-the-art accuracy on Tateoba dataset.
Matching performance with all language pairs using only one rich-resource pair.
Performance improves with more rich-resource language pairs, lessening the need for low-resource data.
Abstract
Aligning parallel sentences in multilingual corpora is essential to curating data for downstream applications such as Machine Translation. In this work, we present OneAligner, an alignment model specially designed for sentence retrieval tasks. This model is able to train on only one language pair and transfers, in a cross-lingual fashion, to low-resource language pairs with negligible degradation in performance. When trained with all language pairs of a large-scale parallel multilingual corpus (OPUS-100), this model achieves the state-of-the-art result on the Tateoba dataset, outperforming an equally-sized previous model by 8.0 points in accuracy while using less than 0.6% of their parallel data. When finetuned on a single rich-resource language pair, be it English-centered or not, our model is able to match the performance of the ones finetuned on all language pairs under the same data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
