Capitalization Normalization for Language Modeling with an Accurate and   Efficient Hierarchical RNN Model

Hao Zhang; You-Chi Cheng; Shankar Kumar; W. Ronny Huang and; Mingqing Chen; Rajiv Mathews

arXiv:2202.08171·cs.CL·February 17, 2022

Capitalization Normalization for Language Modeling with an Accurate and Efficient Hierarchical RNN Model

Hao Zhang, You-Chi Cheng, Shankar Kumar, W. Ronny Huang and, Mingqing Chen, Rajiv Mathews

PDF

Open Access

TL;DR

This paper introduces a fast, accurate hierarchical RNN model for capitalization normalization, improving language modeling and real-world applications like virtual keyboards and speech recognition.

Contribution

The paper presents a novel two-level hierarchical RNN for truecasing that is both efficient and effective, enabling better language models and practical applications.

Findings

01

Achieves same perplexity as gold-standard models on normalized text

02

Reduces prediction error rates in virtual keyboard applications

03

Lowers character and word error rates in speech recognition

Abstract

Capitalization normalization (truecasing) is the task of restoring the correct case (uppercase or lowercase) of noisy text. We propose a fast, accurate and compact two-level hierarchical word-and-character-based recurrent neural network model. We use the truecaser to normalize user-generated text in a Federated Learning framework for language modeling. A case-aware language model trained on this normalized text achieves the same perplexity as a model trained on text with gold capitalization. In a real user A/B experiment, we demonstrate that the improvement translates to reduced prediction error rates in a virtual keyboard application. Similarly, in an ASR language model fusion experiment, we show reduction in uppercase character error rate and word error rate.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques