Audio Representation Learning by Distilling Video as Privileged   Information

Amirhossein Hajavi; Ali Etemad

arXiv:2302.02845·cs.SD·February 7, 2023

Audio Representation Learning by Distilling Video as Privileged Information

Amirhossein Hajavi, Ali Etemad

PDF

Open Access

TL;DR

This paper introduces a novel audio representation learning method that distills video information into audio models using privileged information, improving performance when video data is unavailable at inference.

Contribution

The work proposes a teacher-student knowledge distillation approach using embeddings for audio-only inference, applicable to both sequential and non-sequential data settings.

Findings

01

Significant improvements in speaker recognition accuracy.

02

Enhanced speech emotion recognition performance.

03

Outperforms prior LUPI-based methods.

Abstract

Deep audio representation learning using multi-modal audio-visual data often leads to a better performance compared to uni-modal approaches. However, in real-world scenarios both modalities are not always available at the time of inference, leading to performance degradation by models trained for multi-modal inference. In this work, we propose a novel approach for deep audio representation learning using audio-visual data when the video modality is absent at inference. For this purpose, we adopt teacher-student knowledge distillation under the framework of learning using privileged information (LUPI). While the previous methods proposed for LUPI use soft-labels generated by the teacher, in our proposed method we use embeddings learned by the teacher to train the student network. We integrate our method in two different settings: sequential data where the features are divided into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Hearing Loss and Rehabilitation

MethodsTest · Knowledge Distillation