An approach for mistranslation removal from popular dataset for Indic MT Task
Sudhansu Bala Das, Leo Raphael Rodrigues, Tapas Kumar Mishra, Bidyut, Kr. Patra

TL;DR
This paper presents an algorithm to remove mistranslations from the Samanantar dataset for Indian languages, improving the quality of neural machine translation systems for Hindi and Odia.
Contribution
The paper introduces a novel algorithm for filtering mistranslations in large parallel datasets, enhancing translation quality for Indic language machine translation.
Findings
Removing mistranslations improves translation quality metrics.
ILs-English translation systems outperform English-ILs systems.
Dataset cleaning leads to better NMT performance.
Abstract
The conversion of content from one language to another utilizing a computer system is known as Machine Translation (MT). Various techniques have come up to ensure effective translations that retain the contextual and lexical interpretation of the source language. End-to-end Neural Machine Translation (NMT) is a popular technique and it is now widely used in real-world MT systems. Massive amounts of parallel datasets (sentences in one language alongside translations in another) are required for MT systems. These datasets are crucial for an MT system to learn linguistic structures and patterns of both languages during the training phase. One such dataset is Samanantar, the largest publicly accessible parallel dataset for Indian languages (ILs). Since the corpus has been gathered from various sources, it contains many incorrect translations. Hence, the MT systems built using this dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques
