indic-punct: An automatic punctuation restoration and inverse text   normalization framework for Indic languages

Anirudh Gupta; Neeraj Chhimwal; Ankur Dhuriya; Rishabh Gaur; Priyanshi; Shah; Harveen Singh Chadha; Vivek Raghavan

arXiv:2203.16825·cs.CL·April 1, 2022·1 cites

indic-punct: An automatic punctuation restoration and inverse text normalization framework for Indic languages

Anirudh Gupta, Neeraj Chhimwal, Ankur Dhuriya, Rishabh Gaur, Priyanshi, Shah, Harveen Singh Chadha, Vivek Raghavan

PDF

Open Access 1 Repo

TL;DR

This paper introduces indic-punct, a framework for automatic punctuation restoration and inverse text normalization for 11 Indic languages, leveraging pretrained models and WFST grammars to improve NLP tasks.

Contribution

It presents a novel approach combining pretrained IndicBERT and WFST grammars for punctuation and normalization in multiple Indic languages.

Findings

01

Achieved effective punctuation restoration across 11 languages.

02

Demonstrated improvements in downstream NLP tasks.

03

Provided publicly available code and data.

Abstract

Automatic Speech Recognition (ASR) generates text which is most of the times devoid of any punctuation. Absence of punctuation is text can affect readability. Also, down stream NLP tasks such as sentiment analysis, machine translation, greatly benefit by having punctuation and sentence boundary information. We present an approach for automatic punctuation of text using a pretrained IndicBERT model. Inverse text normalization is done by hand writing weighted finite state transducer (WFST) grammars. We have developed this tool for 11 Indic languages namely Hindi, Tamil, Telugu, Kannada, Gujarati, Marathi, Odia, Bengali, Assamese, Malayalam and Punjabi. All code and data is publicly. available

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

open-speech-ekstep/indic-punct
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Speech and dialogue systems