Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-resource Speech Recognition
Cheng Yi, Shiyu Zhou, Bo Xu

TL;DR
This paper presents a method to improve low-resource speech recognition by fusing pretrained acoustic and linguistic encoders, achieving better performance with limited labeled data.
Contribution
It introduces a novel fusion of wav2vec2.0 and BERT encoders with a monotonic attention mechanism and scheduled fine-tuning for enhanced low-resource ASR.
Findings
Outperforms existing end-to-end models on CALLHOME corpus
Effective utilization of pretrained modules improves recognition accuracy
Scheduled fine-tuning preserves linguistic context modeling
Abstract
End-to-end models have achieved impressive results on the task of automatic speech recognition (ASR). For low-resource ASR tasks, however, labeled data can hardly satisfy the demand of end-to-end models. Self-supervised acoustic pre-training has already shown its amazing ASR performance, while the transcription is still inadequate for language modeling in end-to-end models. In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model. The fused model only needs to learn the transfer from speech to language during fine-tuning on limited labeled data. The length of the two modalities is matched by a monotonic attention mechanism without additional parameters. Besides, a fully connected layer is introduced for the hidden mapping between modalities. We further propose a scheduled fine-tuning strategy to preserve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
