Contextualized End-to-end Automatic Speech Recognition with Intermediate   Biasing Loss

Muhammad Shakeel; Yui Sudo; Yifan Peng; Shinji Watanabe

arXiv:2406.16120·eess.AS·September 12, 2024

Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss

Muhammad Shakeel, Yui Sudo, Yifan Peng, Shinji Watanabe

PDF

Open Access

TL;DR

This paper introduces an intermediate biasing loss in end-to-end speech recognition models, explicitly leveraging contextual knowledge in intermediate layers to improve accuracy, especially for biased words.

Contribution

The novel approach of using an explicit biasing loss at intermediate encoder layers enhances contextualization and regularization in speech recognition models.

Findings

01

22.5% relative improvement in biased WER on LibriSpeech

02

Up to 44% reduction in WER compared to non-contextual baseline

03

Further WER reduction with RNN-transducer-driven joint decoding

Abstract

Contextualized end-to-end automatic speech recognition has been an active research area, with recent efforts focusing on the implicit learning of contextual phrases based on the final loss objective. However, these approaches ignore the useful contextual knowledge encoded in the intermediate layers. We hypothesize that employing explicit biasing loss as an auxiliary task in the encoder intermediate layers may better align text tokens or audio frames with the desired objectives. Our proposed intermediate biasing loss brings more regularization and contextualization to the network. Our method outperforms a conventional contextual biasing baseline on the LibriSpeech corpus, achieving a relative improvement of 22.5% in biased word error rate (B-WER) and up to 44% compared to the non-contextual baseline with a biasing list size of 100. Moreover, employing RNN-transducer-driven joint decoding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing

MethodsALIGN