AV-data2vec: Self-supervised Learning of Audio-Visual Speech   Representations with Contextualized Target Representations

Jiachen Lian; Alexei Baevski; Wei-Ning Hsu; Michael Auli

arXiv:2302.06419·eess.AS·January 23, 2024·1 cites

AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations

Jiachen Lian, Alexei Baevski, Wei-Ning Hsu, Michael Auli

PDF

Open Access

TL;DR

AV-data2vec introduces a self-supervised, end-to-end audio-visual speech representation learning method using a shared transformer, which improves speech recognition performance by effectively combining audio and video modalities.

Contribution

It presents AV-data2vec, a novel joint audio-visual self-supervised learning approach with contextualized target representations and a shared transformer encoder.

Findings

01

Outperforms existing methods on LRS3 dataset

02

Consistently improves speech recognition accuracy

03

Effective integration of audio and video modalities

Abstract

Self-supervision has shown great potential for audio-visual speech recognition by vastly reducing the amount of labeled data required to build good systems. However, existing methods are either not entirely end-to-end or do not train joint representations of both modalities. In this paper, we introduce AV-data2vec which addresses these challenges and builds audio-visual representations based on predicting contextualized representations which has been successful in the uni-modal case. The model uses a shared transformer encoder for both audio and video and can combine both modalities to improve speech recognition. Results on LRS3 show that AV-data2vec consistently outperforms existing methods under all settings with the same amount of data and model size.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis