indic-punct: An automatic punctuation restoration and inverse text normalization framework for Indic languages
Anirudh Gupta, Neeraj Chhimwal, Ankur Dhuriya, Rishabh Gaur, Priyanshi, Shah, Harveen Singh Chadha, Vivek Raghavan

TL;DR
This paper introduces indic-punct, a framework for automatic punctuation restoration and inverse text normalization for 11 Indic languages, leveraging pretrained models and WFST grammars to improve NLP tasks.
Contribution
It presents a novel approach combining pretrained IndicBERT and WFST grammars for punctuation and normalization in multiple Indic languages.
Findings
Achieved effective punctuation restoration across 11 languages.
Demonstrated improvements in downstream NLP tasks.
Provided publicly available code and data.
Abstract
Automatic Speech Recognition (ASR) generates text which is most of the times devoid of any punctuation. Absence of punctuation is text can affect readability. Also, down stream NLP tasks such as sentiment analysis, machine translation, greatly benefit by having punctuation and sentence boundary information. We present an approach for automatic punctuation of text using a pretrained IndicBERT model. Inverse text normalization is done by hand writing weighted finite state transducer (WFST) grammars. We have developed this tool for 11 Indic languages namely Hindi, Tamil, Telugu, Kannada, Gujarati, Marathi, Odia, Bengali, Assamese, Malayalam and Punjabi. All code and data is publicly. available
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Speech and dialogue systems
