Multi-Modal Pre-Training for Automated Speech Recognition

David M. Chan; Shalini Ghosh; Debmalya Chakrabarty; Bj\"orn; Hoffmeister

arXiv:2110.09890·eess.AS·September 19, 2022

Multi-Modal Pre-Training for Automated Speech Recognition

David M. Chan, Shalini Ghosh, Debmalya Chakrabarty, Bj\"orn, Hoffmeister

PDF

Open Access

TL;DR

This paper introduces a multi-modal pre-training approach that incorporates environmental context into speech recognition, improving robustness and accuracy over traditional local-only methods.

Contribution

It presents a novel self-supervised multi-modal encoding and deep-fusion framework that enhances ASR performance by leveraging environmental information.

Findings

01

Up to 7% improvement on Librispeech

02

6% to 45% gains on internal datasets

03

Enhanced robustness to noise and corruption

Abstract

Traditionally, research in automated speech recognition has focused on local-first encoding of audio representations to predict the spoken phonemes in an utterance. Unfortunately, approaches relying on such hyper-local information tend to be vulnerable to both local-level corruption (such as audio-frame drops, or loud noises) and global-level noise (such as environmental noise, or background noise) that has not been seen during training. In this work, we introduce a novel approach which leverages a self-supervised learning technique based on masked language modeling to compute a global, multi-modal encoding of the environment in which the utterance occurs. We then use a new deep-fusion framework to integrate this global context into a traditional ASR method, and demonstrate that the resulting method can outperform baseline methods by up to 7% on Librispeech; gains on internal datasets…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing