Statistical Machine Translation for Indic Languages
Sudhansu Bala Das, Divyajoti Panda, Tapas Kumar Mishra, Bidyut Kr. Patra

TL;DR
This paper develops bilingual Statistical Machine Translation models for English and fifteen low-resource Indian languages, utilizing datasets, preprocessing, and reordering techniques, evaluated with standard translation quality metrics.
Contribution
It introduces a comprehensive approach for SMT model development for multiple low-resource Indian languages using open-source tools and dataset analysis.
Findings
Effective preprocessing reduces dataset noise.
Reordering improves translation accuracy.
Models achieve competitive BLEU, METEOR, and RIBES scores.
Abstract
Machine Translation (MT) system generally aims at automatic representation of source language into target language retaining the originality of context using various Natural Language Processing (NLP) techniques. Among various NLP methods, Statistical Machine Translation(SMT). SMT uses probabilistic and statistical techniques to analyze information and conversion. This paper canvasses about the development of bilingual SMT models for translating English to fifteen low-resource Indian Languages (ILs) and vice versa. At the outset, all 15 languages are briefed with a short description related to our experimental need. Further, a detailed analysis of Samanantar and OPUS dataset for model building, along with standard benchmark dataset (Flores-200) for fine-tuning and testing, is done as a part of our experiment. Different preprocessing approaches are proposed in this paper to handle the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text and Document Classification Technologies
