Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation
Kaustubh Shivshankar Shejole, Sourabh Deoghare, Pushpak Bhattacharyya

TL;DR
This paper introduces Viram, a benchmark for testing punctuation robustness in English-Marathi NMT, and evaluates strategies to improve translation quality when punctuation is missing or incorrect.
Contribution
The work presents a new benchmark dataset and compares remediation strategies, showing their effectiveness over existing models in handling punctuation errors.
Findings
Both remediation strategies improve NMT performance significantly.
Current LLMs are less robust than task-specific strategies for punctuation errors.
Viram benchmark exposes weaknesses in existing NMT systems regarding punctuation robustness.
Abstract
Neural Machine Translation (NMT) systems rely heavily on explicit punctuation cues to resolve semantic ambiguities in a source sentence. Inputting user-generated sentences, which are likely to contain missing or incorrect punctuation, results in fluent but semantically disastrous translations. This work attempts to highlight and address the problem of punctuation robustness of NMT systems through an English-to-Marathi translation. First, we introduce \textbf{\textit{Viram}}, a human-curated diagnostic benchmark of 54 punctuation-ambiguous English-Marathi sentence pairs to stress-test existing NMT systems. Second, we evaluate two simple remediation strategies: cascade-based \textit{restore-then-translate} and \textit{direct fine-tuning}. Our experimental results and analysis demonstrate that both strategies yield substantial NMT performance improvements. Furthermore, we find that current…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗thenlpresearcher/iitb-en-indic-without-punctmodel· 4 dl4 dl
- 🤗thenlpresearcher/iitb-en-indic-only-punctmodel· 1 dl1 dl
- 🤗thenlpresearcher/shalaka_indictrans2-en-indic-dist-200M_finetuned_eng_Latn_to_mar_Devamodel
- 🤗thenlpresearcher/shalaka_fd_indictrans2-en-indic-dist-200M_finetuned_eng_Latn_to_mar_Devamodel· 5 dl5 dl
- 🤗thenlpresearcher/iitb-t5-finetuned-punctuationmodel· 4 dl4 dl
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
