Audio-Visual Representation Learning via Knowledge Distillation from   Speech Foundation Models

Jing-Xuan Zhang; Genshun Wan; Jianqing Gao; Zhen-Hua Ling

arXiv:2502.05766·eess.AS·February 11, 2025·2 cites

Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models

Jing-Xuan Zhang, Genshun Wan, Jianqing Gao, Zhen-Hua Ling

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel audio-visual representation learning approach that uses cross-modal knowledge distillation from speech foundation models, significantly improving performance on various speech recognition tasks.

Contribution

The paper proposes a new method leveraging multi-layer representations from speech foundation models as teachers for cross-modal knowledge distillation in audio-visual learning.

Findings

01

Achieved superior or comparable results to state-of-the-art baselines.

02

Demonstrated effectiveness across multiple speech recognition tasks.

03

Validated the approach through extensive ablation studies and visualization.

Abstract

Audio-visual representation learning is crucial for advancing multimodal speech processing tasks, such as lipreading and audio-visual speech recognition. Recently, speech foundation models (SFMs) have shown remarkable generalization capabilities across various speech-related tasks. Building on this progress, we propose an audio-visual representation learning model that leverages cross-modal knowledge distillation from SFMs. In our method, SFMs serve as teachers, from which multi-layer hidden representations are extracted using clean audio inputs. We also introduce a multi-teacher ensemble method to distill the student, which receives audio-visual data as inputs. A novel representational knowledge distillation loss is employed to train the student during pretraining, which is also applied during finetuning to further enhance the performance on downstream tasks. Our experiments utilized…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jxzhanggg/DistillAV
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies

MethodsKnowledge Distillation