Multitask Training with Text Data for End-to-End Speech Recognition

Peidong Wang; Tara N. Sainath; Ron J. Weiss

arXiv:2010.14318·cs.CL·June 15, 2021·1 cites

Multitask Training with Text Data for End-to-End Speech Recognition

Peidong Wang, Tara N. Sainath, Ron J. Weiss

PDF

Open Access

TL;DR

This paper introduces a multitask training approach for end-to-end speech recognition that leverages both audio-text and text-only data, improving accuracy without extra language models.

Contribution

It presents a novel multitask training method for attention-based speech recognition models that enhances performance by integrating language information from text data.

Findings

01

11% relative performance improvement on LibriSpeech 100-hour subset

02

Approaches the performance of language model shallow fusion

03

Effective incorporation of language-level information

Abstract

We propose a multitask training method for attention-based end-to-end speech recognition models. We regularize the decoder in a listen, attend, and spell model by multitask training it on both audio-text and text-only data. Trained on the 100-hour subset of LibriSpeech, the proposed method, without requiring an additional language model, leads to an 11% relative performance improvement over the baseline and approaches the performance of language model shallow fusion on the test-clean evaluation set. We observe a similar trend on the whole 960-hour LibriSpeech training set. Analyses of different types of errors and sample output sentences demonstrate that the proposed method can incorporate language level information, suggesting its effectiveness in real-world applications.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling