A Data Selection Approach for Enhancing Low Resource Machine Translation Using Cross-Lingual Sentence Representations
Nidhi Kowtal, Tejas Deshpande, Raviraj Joshi

TL;DR
This paper introduces a data filtering method using cross-lingual sentence representations to improve low-resource English-Marathi machine translation by removing noisy data and enhancing translation quality.
Contribution
It presents a novel filtering approach leveraging multilingual SBERT models to improve data quality in low-resource machine translation tasks.
Findings
Significant translation quality improvement after filtering
Effective removal of problematic translations using IndicSBERT
Framework applicable to other low-resource language pairs
Abstract
Machine translation in low-resource language pairs faces significant challenges due to the scarcity of parallel corpora and linguistic resources. This study focuses on the case of English-Marathi language pairs, where existing datasets are notably noisy, impeding the performance of machine translation models. To mitigate the impact of data quality issues, we propose a data filtering approach based on cross-lingual sentence representations. Our methodology leverages a multilingual SBERT model to filter out problematic translations in the training data. Specifically, we employ an IndicSBERT similarity model to assess the semantic equivalence between original and translated sentences, allowing us to retain linguistically correct translations while discarding instances with substantial deviations. The results demonstrate a significant improvement in translation quality over the baseline…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Softmax · Dropout · Layer Normalization · Linear Layer · Adam · Weight Decay · Dense Connections · WordPiece
