An approach for mistranslation removal from popular dataset for Indic MT   Task

Sudhansu Bala Das; Leo Raphael Rodrigues; Tapas Kumar Mishra; Bidyut; Kr. Patra

arXiv:2401.06398·cs.CL·January 15, 2024·1 cites

An approach for mistranslation removal from popular dataset for Indic MT Task

Sudhansu Bala Das, Leo Raphael Rodrigues, Tapas Kumar Mishra, Bidyut, Kr. Patra

PDF

Open Access

TL;DR

This paper presents an algorithm to remove mistranslations from the Samanantar dataset for Indian languages, improving the quality of neural machine translation systems for Hindi and Odia.

Contribution

The paper introduces a novel algorithm for filtering mistranslations in large parallel datasets, enhancing translation quality for Indic language machine translation.

Findings

01

Removing mistranslations improves translation quality metrics.

02

ILs-English translation systems outperform English-ILs systems.

03

Dataset cleaning leads to better NMT performance.

Abstract

The conversion of content from one language to another utilizing a computer system is known as Machine Translation (MT). Various techniques have come up to ensure effective translations that retain the contextual and lexical interpretation of the source language. End-to-end Neural Machine Translation (NMT) is a popular technique and it is now widely used in real-world MT systems. Massive amounts of parallel datasets (sentences in one language alongside translations in another) are required for MT systems. These datasets are crucial for an MT system to learn linguistic structures and patterns of both languages during the training phase. One such dataset is Samanantar, the largest publicly accessible parallel dataset for Indian languages (ILs). Since the corpus has been gathered from various sources, it contains many incorrect translations. Hence, the MT systems built using this dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques