End-to-End Speech Recognition with Pre-trained Masked Language Model

Yosuke Higuchi; Tetsuji Ogawa; Tetsunori Kobayashi; Shinji Watanabe

arXiv:2410.00528·eess.AS·October 2, 2024

End-to-End Speech Recognition with Pre-trained Masked Language Model

Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi, Shinji Watanabe

PDF

Open Access 1 Repo

TL;DR

This paper introduces BERT-CTC and BECTRA, innovative end-to-end speech recognition models that incorporate pre-trained BERT language models to enhance transcription accuracy by effectively integrating linguistic context.

Contribution

The paper presents novel methods to embed pre-trained BERT models into end-to-end ASR systems, addressing independence assumptions and vocabulary differences for improved performance.

Findings

01

Improved accuracy over baseline CTC and transducer models

02

Effective integration of BERT enhances linguistic understanding in ASR

03

Architectural designs are validated through comprehensive experiments

Abstract

We present a novel approach to end-to-end automatic speech recognition (ASR) that utilizes pre-trained masked language models (LMs) to facilitate the extraction of linguistic information. The proposed models, BERT-CTC and BECTRA, are specifically designed to effectively integrate pre-trained LMs (e.g., BERT) into end-to-end ASR models. BERT-CTC adapts BERT for connectionist temporal classification (CTC) by addressing the constraint of the conditional independence assumption between output tokens. This enables explicit conditioning of BERT's contextualized embeddings in the ASR process, seamlessly merging audio and linguistic information through an iterative refinement algorithm. BECTRA extends BERT-CTC to the transducer framework and trains the decoder network using a vocabulary suitable for ASR training. This aims to bridge the gap between the text processed in end-to-end ASR and BERT,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yosukehiguchi/espnet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Softmax · Multi-Head Attention · Dense Connections · WordPiece · Residual Connection · Adam · Attention Dropout