Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for   Low-resource Speech Recognition

Cheng Yi; Shiyu Zhou; Bo Xu

arXiv:2101.06699·cs.CL·May 12, 2021

Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-resource Speech Recognition

Cheng Yi, Shiyu Zhou, Bo Xu

PDF

TL;DR

This paper presents a method to improve low-resource speech recognition by fusing pretrained acoustic and linguistic encoders, achieving better performance with limited labeled data.

Contribution

It introduces a novel fusion of wav2vec2.0 and BERT encoders with a monotonic attention mechanism and scheduled fine-tuning for enhanced low-resource ASR.

Findings

01

Outperforms existing end-to-end models on CALLHOME corpus

02

Effective utilization of pretrained modules improves recognition accuracy

03

Scheduled fine-tuning preserves linguistic context modeling

Abstract

End-to-end models have achieved impressive results on the task of automatic speech recognition (ASR). For low-resource ASR tasks, however, labeled data can hardly satisfy the demand of end-to-end models. Self-supervised acoustic pre-training has already shown its amazing ASR performance, while the transcription is still inadequate for language modeling in end-to-end models. In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model. The fused model only needs to learn the transfer from speech to language during fine-tuning on limited labeled data. The length of the two modalities is matched by a monotonic attention mechanism without additional parameters. Besides, a fully connected layer is introduced for the hidden mapping between modalities. We further propose a scheduled fine-tuning strategy to preserve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.