BERT Meets CTC: New Formulation of End-to-End Speech Recognition with   Pre-trained Masked Language Model

Yosuke Higuchi; Brian Yan; Siddhant Arora; Tetsuji Ogawa; Tetsunori; Kobayashi; Shinji Watanabe

arXiv:2210.16663·eess.AS·April 21, 2023

BERT Meets CTC: New Formulation of End-to-End Speech Recognition with Pre-trained Masked Language Model

Yosuke Higuchi, Brian Yan, Siddhant Arora, Tetsuji Ogawa, Tetsunori, Kobayashi, Shinji Watanabe

PDF

Open Access

TL;DR

This paper introduces BERT-CTC, a new end-to-end speech recognition model that integrates BERT's contextual embeddings with CTC, relaxing independence assumptions and enhancing linguistic understanding for improved accuracy and downstream task performance.

Contribution

It proposes a novel formulation combining BERT with CTC for speech recognition, enabling contextual dependencies and improving robustness across languages and speaking styles.

Findings

01

Outperforms conventional CTC-based models in accuracy

02

Enhances robustness across languages and speaking styles

03

Benefits downstream spoken language understanding tasks

Abstract

This paper presents BERT-CTC, a novel formulation of end-to-end speech recognition that adapts BERT for connectionist temporal classification (CTC). Our formulation relaxes the conditional independence assumptions used in conventional CTC and incorporates linguistic knowledge through the explicit output dependency obtained by BERT contextual embedding. BERT-CTC attends to the full contexts of the input and hypothesized output sequences via the self-attention mechanism. This mechanism encourages a model to learn inner/inter-dependencies between the audio and token representations while maintaining CTC's training efficiency. During inference, BERT-CTC combines a mask-predict algorithm with CTC decoding, which iteratively refines an output sequence. The experimental results reveal that BERT-CTC improves over conventional approaches across variations in speaking styles and languages.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsMulti-Head Attention · Attention Is All You Need · Linear Warmup With Linear Decay · Attention Dropout · Weight Decay · Dense Connections · Linear Layer · Layer Normalization · Residual Connection · Dropout