Self-Supervised Audio-Visual Speech Representations Learning By   Multimodal Self-Distillation

Jing-Xuan Zhang; Genshun Wan; Zhen-Hua Ling; Jia Pan; Jianqing Gao,; Cong Liu

arXiv:2212.02782·eess.AS·December 7, 2022

Self-Supervised Audio-Visual Speech Representations Learning By Multimodal Self-Distillation

Jing-Xuan Zhang, Genshun Wan, Zhen-Hua Ling, Jia Pan, Jianqing Gao,, Cong Liu

PDF

Open Access

TL;DR

This paper introduces AV2vec, a self-distillation method for learning audio-visual speech representations that reduces training time and improves downstream task performance with an augmented MLM-style loss.

Contribution

The paper proposes AV2vec, a novel self-distillation approach that eliminates iterative training steps and enhances performance with multitask learning.

Findings

01

AV2vec reduces training time to less than one-fifth of AV-HuBERT.

02

AV2vec achieves comparable performance to AV-HuBERT baseline.

03

AV2vec-MLM outperforms baselines on downstream tasks.

Abstract

In this work, we present a novel method, named AV2vec, for learning audio-visual speech representations by multimodal self-distillation. AV2vec has a student and a teacher module, in which the student performs a masked latent feature regression task using the multimodal target features generated online by the teacher. The parameters of the teacher model are a momentum update of the student. Since our target features are generated online, AV2vec needs no iteration step like AV-HuBERT and the total training time cost is reduced to less than one-fifth. We further propose AV2vec-MLM in this study, which augments AV2vec with a masked language model (MLM)-style loss using multitask learning. Our experimental results show that AV2vec achieved comparable performance to the AV-HuBERT baseline. When combined with an MLM-style loss, AV2vec-MLM outperformed baselines and achieved the best…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing