Speech-text based multi-modal training with bidirectional attention for   improved speech recognition

Yuhang Yang; Haihua Xu; Hao Huang; Eng Siong Chng; Sheng Li

arXiv:2211.00325·eess.AS·November 2, 2022

Speech-text based multi-modal training with bidirectional attention for improved speech recognition

Yuhang Yang, Haihua Xu, Hao Huang, Eng Siong Chng, Sheng Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces a bidirectional attention mechanism for multi-modal training in speech recognition, enabling better use of unpaired text data and improving model performance.

Contribution

It proposes a novel bidirectional attention mechanism to synchronize speech and text features, enhancing data efficiency and representation quality in end-to-end ASR models.

Findings

01

Up to 6.15% WERR with only paired data

02

Up to 9.23% WERR with additional unpaired text data

03

Improved speech and text representations for ASR

Abstract

To let the state-of-the-art end-to-end ASR model enjoy data efficiency, as well as much more unpaired text data by multi-modal training, one needs to address two problems: 1) the synchronicity of feature sampling rates between speech and language (aka text data); 2) the homogeneity of the learned representations from two encoders. In this paper we propose to employ a novel bidirectional attention mechanism (BiAM) to jointly learn both ASR encoder (bottom layers) and text encoder with a multi-modal learning method. The BiAM is to facilitate feature sampling rate exchange, realizing the quality of the transformed features for the one kind to be measured in another space, with diversified objective functions. As a result, the speech representations are enriched with more linguistic information, while the representations generated by the text encoder are more similar to corresponding speech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuhangear/multi-modal-learning
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques