VQ-T: RNN Transducers using Vector-Quantized Prediction Network States

Jiatong Shi; George Saon; David Haws; Shinji Watanabe; Brian Kingsbury

arXiv:2208.01818·cs.SD·August 4, 2022

VQ-T: RNN Transducers using Vector-Quantized Prediction Network States

Jiatong Shi, George Saon, David Haws, Shinji Watanabe, Brian Kingsbury

PDF

Open Access

TL;DR

This paper introduces VQ-T, an RNN transducer model using vector-quantized prediction networks that enable hypothesis merging, resulting in improved speech recognition accuracy and more efficient lattice generation.

Contribution

The paper proposes a novel VQ-LSTM prediction network for RNN transducers, allowing hypothesis merging and improved lattice density and accuracy.

Findings

01

Improved WER over standard transducers on Switchboard

02

Denser lattices with low oracle WER at same beam size

03

Effective lattice generation for rescoring

Abstract

Beam search, which is the dominant ASR decoding algorithm for end-to-end models, generates tree-structured hypotheses. However, recent studies have shown that decoding with hypothesis merging can achieve a more efficient search with comparable or better performance. But, the full context in recurrent networks is not compatible with hypothesis merging. We propose to use vector-quantized long short-term memory units (VQ-LSTM) in the prediction network of RNN transducers. By training the discrete representation jointly with the ASR network, hypotheses can be actively merged for lattice generation. Our experiments on the Switchboard corpus show that the proposed VQ RNN transducers improve ASR performance over transducers with regular prediction networks while also producing denser lattices with a very low oracle word error rate (WER) for the same beam size. Additional language model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling