An Empirical Study of Language Model Integration for Transducer based   Speech Recognition

Huahuan Zheng; Keyu An; Zhijian Ou; Chen Huang; Ke Ding; Guanglu Wan

arXiv:2203.16776·eess.AS·August 4, 2022

An Empirical Study of Language Model Integration for Transducer based Speech Recognition

Huahuan Zheng, Keyu An, Zhijian Ou, Chen Huang, Ke Ding, Guanglu Wan

PDF

Open Access

TL;DR

This paper investigates methods for integrating external language models into RNN-Transducer speech recognition, proposing a low-order density ratio approach that improves performance across multiple datasets and outperforms traditional methods.

Contribution

The paper introduces a low-order density ratio method (LODR) that enhances language model integration in RNN-T speech recognition, outperforming shallow fusion and matching or exceeding existing advanced methods.

Findings

01

LODR consistently outperforms shallow fusion across datasets.

02

LODR performs close to ILME and better than DR in most tests.

03

Extensive experiments validate the effectiveness of LODR in various scenarios.

Abstract

Utilizing text-only data with an external language model (ELM) in end-to-end RNN-Transducer (RNN-T) for speech recognition is challenging. Recently, a class of methods such as density ratio (DR) and internal language model estimation (ILME) have been developed, outperforming the classic shallow fusion (SF) method. The basic idea behind these methods is that RNN-T posterior should first subtract the implicitly learned internal language model (ILM) prior, in order to integrate the ELM. While recent studies suggest that RNN-T only learns some low-order language model information, the DR method uses a well-trained neural language model with full context, which may be inappropriate for the estimation of ILM and deteriorate the integration performance. Based on the DR method, we propose a low-order density ratio method (LODR) by replacing the estimation with a low-order weak language model.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing