Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual   Speech Recognition

Xichen Pan; Peiyu Chen; Yichen Gong; Helong Zhou; Xinbing Wang,; Zhouhan Lin

arXiv:2203.07996·cs.SD·March 29, 2022

Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

Xichen Pan, Peiyu Chen, Yichen Gong, Helong Zhou, Xinbing Wang,, Zhouhan Lin

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates how unimodal self-supervised learning can be effectively integrated into multimodal audio-visual speech recognition models, significantly improving performance on benchmark datasets without relying on external language models.

Contribution

It introduces a novel framework that leverages pretrained unimodal models for audio and visual data to enhance multimodal AVSR performance.

Findings

01

Achieved state-of-the-art results on LRS2 dataset.

02

Improved performance by 30% relative without external language models.

03

Validated effectiveness on both word-level and sentence-level tasks.

Abstract

Training Transformer-based models demands a large amount of data, while obtaining aligned and labelled data in multimodality is rather cost-demanding, especially for audio-visual speech recognition (AVSR). Thus it makes a lot of sense to make use of unlabelled unimodal data. On the other side, although the effectiveness of large-scale self-supervised learning is well established in both audio and visual modalities, how to integrate those pre-trained models into a multimodal scenario remains underexplored. In this work, we successfully leverage unimodal self-supervised learning to promote the multimodal AVSR. In particular, audio and visual front-ends are trained on large-scale unimodal datasets, then we integrate components of both front-ends into a larger multimodal framework which learns to recognize parallel audio-visual data into characters through a combination of CTC and seq2seq…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lumia-group/leveraging-self-supervised-learning-for-avsr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing