Cross-language Sentence Selection via Data Augmentation and Rationale Training
Yanda Chen, Chris Kedzie, Suraj Nair, Petra Galu\v{s}\v{c}\'akov\'a,, Rui Zhang, Douglas W. Oard, Kathleen McKeown

TL;DR
This paper introduces a novel cross-language sentence selection method using data augmentation and rationale training, achieving competitive results in low-resource settings by directly learning cross-lingual relevance models from noisy parallel data.
Contribution
It presents a new approach combining data augmentation, negative sampling, and rationale training to improve cross-language sentence selection in low-resource scenarios.
Findings
Outperforms state-of-the-art translation + monolingual retrieval systems.
Consistent improvements across English-Somali, English-Swahili, and English-Tagalog.
Effective in low-resource language pairs.
Abstract
This paper proposes an approach to cross-language sentence selection in a low-resource setting. It uses data augmentation and negative sampling techniques on noisy parallel sentence data to directly learn a cross-lingual embedding-based query relevance model. Results show that this approach performs as well as or better than multiple state-of-the-art machine translation + monolingual retrieval systems trained on the same parallel data. Moreover, when a rationale training secondary objective is applied to encourage the model to match word alignment hints from a phrase-based statistical machine translation model, consistent improvements are seen across three language pairs (English-Somali, English-Swahili and English-Tagalog) over a variety of state-of-the-art baselines.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies
