Capitalization and Punctuation Restoration: a Survey
Vasile P\u{a}i\c{s}, Dan Tufi\c{s}

TL;DR
This survey reviews historical and modern techniques for restoring punctuation and casing in text, crucial for NLP tasks involving unpunctuated or un-cased sources like speech recognition outputs and social media texts.
Contribution
It provides a comprehensive overview of methods, challenges, and future research directions in punctuation and casing restoration for unstructured text.
Findings
Historical and recent techniques are compared and analyzed.
Current challenges in punctuation and casing restoration are identified.
Future research directions are proposed.
Abstract
Ensuring proper punctuation and letter casing is a key pre-processing step towards applying complex natural language processing algorithms. This is especially significant for textual sources where punctuation and casing are missing, such as the raw output of automatic speech recognition systems. Additionally, short text messages and micro-blogging platforms offer unreliable and often wrong punctuation and casing. This survey offers an overview of both historical and state-of-the-art techniques for restoring punctuation and correcting word casing. Furthermore, current challenges and research directions are highlighted.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
