Metric Learning with Progressive Self-Distillation for Audio-Visual   Embedding Learning

Donghuo Zeng; Kazushi Ikeda

arXiv:2501.09608·cs.SD·January 17, 2025

Metric Learning with Progressive Self-Distillation for Audio-Visual Embedding Learning

Donghuo Zeng, Kazushi Ikeda

PDF

Open Access

TL;DR

This paper introduces a novel audio-visual embedding learning method that combines cross-modal triplet loss with progressive self-distillation to better utilize inherent data distributions and improve alignment beyond label guidance.

Contribution

It proposes a new architecture that integrates progressive self-distillation with triplet loss for enhanced audio-visual embedding learning, capturing complex relationships beyond labels.

Findings

01

Improved embedding quality demonstrated on benchmark datasets.

02

Enhanced alignment accuracy between audio and visual modalities.

03

Outperforms existing label-guided methods in various metrics.

Abstract

Metric learning projects samples into an embedded space, where similarities and dissimilarities are quantified based on their learned representations. However, existing methods often rely on label-guided representation learning, where representations of different modalities, such as audio and visual data, are aligned based on annotated labels. This approach tends to underutilize latent complex features and potential relationships inherent in the distributions of audio and visual data that are not directly tied to the labels, resulting in suboptimal performance in audio-visual embedding learning. To address this issue, we propose a novel architecture that integrates cross-modal triplet loss with progressive self-distillation. Our method enhances representation learning by leveraging inherent distributions and dynamically refining soft audio-visual alignments -- probabilistic alignments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation

MethodsTriplet Loss