Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning   for Low-Resource Speech Recognition

Guolin Zheng; Yubei Xiao; Ke Gong; Pan Zhou; Xiaodan Liang; Liang Lin

arXiv:2109.09161·cs.CL·October 12, 2021

Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning for Low-Resource Speech Recognition

Guolin Zheng, Yubei Xiao, Ke Gong, Pan Zhou, Xiaodan Liang, Liang Lin

PDF

Open Access

TL;DR

Wav-BERT is a novel framework that unifies acoustic and linguistic models to improve low-resource speech recognition by effectively fusing speech and text representations.

Contribution

The paper introduces Wav-BERT, a cooperative learning framework that unifies wav2vec 2.0 and BERT with new modules to enhance low-resource speech recognition.

Findings

01

Significantly outperforms existing methods

02

Achieves state-of-the-art results on low-resource datasets

03

Effectively fuses acoustic and linguistic information

Abstract

Unifying acoustic and linguistic representation learning has become increasingly crucial to transfer the knowledge learned on the abundance of high-resource language data for low-resource speech recognition. Existing approaches simply cascade pre-trained acoustic and language models to learn the transfer from speech to text. However, how to solve the representation discrepancy of speech and text is unexplored, which hinders the utilization of acoustic and linguistic information. Moreover, previous works simply replace the embedding layer of the pre-trained language model with the acoustic features, which may cause the catastrophic forgetting problem. In this work, we introduce Wav-BERT, a cooperative acoustic and linguistic representation learning method to fuse and utilize the contextual information of speech and text. Specifically, we unify a pre-trained acoustic model (wav2vec 2.0)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Weight Decay · WordPiece · Layer Normalization · Dense Connections · Attention Dropout · Multi-Head Attention · Linear Warmup With Linear Decay