Audio Representation Learning by Distilling Video as Privileged Information
Amirhossein Hajavi, Ali Etemad

TL;DR
This paper introduces a novel audio representation learning method that distills video information into audio models using privileged information, improving performance when video data is unavailable at inference.
Contribution
The work proposes a teacher-student knowledge distillation approach using embeddings for audio-only inference, applicable to both sequential and non-sequential data settings.
Findings
Significant improvements in speaker recognition accuracy.
Enhanced speech emotion recognition performance.
Outperforms prior LUPI-based methods.
Abstract
Deep audio representation learning using multi-modal audio-visual data often leads to a better performance compared to uni-modal approaches. However, in real-world scenarios both modalities are not always available at the time of inference, leading to performance degradation by models trained for multi-modal inference. In this work, we propose a novel approach for deep audio representation learning using audio-visual data when the video modality is absent at inference. For this purpose, we adopt teacher-student knowledge distillation under the framework of learning using privileged information (LUPI). While the previous methods proposed for LUPI use soft-labels generated by the teacher, in our proposed method we use embeddings learned by the teacher to train the student network. We integrate our method in two different settings: sequential data where the features are divided into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Hearing Loss and Rehabilitation
MethodsTest · Knowledge Distillation
