Learning Audio-Visual Embeddings with Inferred Latent Interaction Graphs
Donghuo Zeng, Hao Niu, Yanan Wang, Masato Taya

TL;DR
This paper introduces a novel framework for learning audio-visual embeddings that leverages soft-label predictions and inferred dependency graphs to improve semantic alignment and robustness against incidental co-occurrences.
Contribution
It proposes a new method combining soft-label alignment, latent interaction graphs, and a regularizer to better capture true semantic relationships in audio-visual data.
Findings
Improved mean average precision on AVE and VEGAS benchmarks.
Effective identification of directional class dependencies.
Enhanced robustness to background noise and unannotated events.
Abstract
Learning robust audio-visual embeddings requires bringing genuinely related audio and visual signals together while filtering out incidental co-occurrences - background noise, unrelated elements, or unannotated events. Most contrastive and triplet-loss methods use sparse annotated labels per clip and treat any co-occurrence as semantic similarity. For example, a video labeled "train" might also contain motorcycle audio and visual, because "motorcycle" is not the chosen annotation; standard methods treat these co-occurrences as negatives to true motorcycle anchors elsewhere, creating false negatives and missing true cross-modal dependencies. We propose a framework that leverages soft-label predictions and inferred latent interactions to address these issues: (1) Audio-Visual Semantic Alignment Loss (AV-SAL) trains a teacher network to produce aligned soft-label distributions across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Music and Audio Processing
