Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning for Low-Resource Speech Recognition
Guolin Zheng, Yubei Xiao, Ke Gong, Pan Zhou, Xiaodan Liang, Liang Lin

TL;DR
Wav-BERT is a novel framework that unifies acoustic and linguistic models to improve low-resource speech recognition by effectively fusing speech and text representations.
Contribution
The paper introduces Wav-BERT, a cooperative learning framework that unifies wav2vec 2.0 and BERT with new modules to enhance low-resource speech recognition.
Findings
Significantly outperforms existing methods
Achieves state-of-the-art results on low-resource datasets
Effectively fuses acoustic and linguistic information
Abstract
Unifying acoustic and linguistic representation learning has become increasingly crucial to transfer the knowledge learned on the abundance of high-resource language data for low-resource speech recognition. Existing approaches simply cascade pre-trained acoustic and language models to learn the transfer from speech to text. However, how to solve the representation discrepancy of speech and text is unexplored, which hinders the utilization of acoustic and linguistic information. Moreover, previous works simply replace the embedding layer of the pre-trained language model with the acoustic features, which may cause the catastrophic forgetting problem. In this work, we introduce Wav-BERT, a cooperative acoustic and linguistic representation learning method to fuse and utilize the contextual information of speech and text. Specifically, we unify a pre-trained acoustic model (wav2vec 2.0)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Weight Decay · WordPiece · Layer Normalization · Dense Connections · Attention Dropout · Multi-Head Attention · Linear Warmup With Linear Decay
