Learning Audio-Visual Embeddings with Inferred Latent Interaction Graphs

Donghuo Zeng; Hao Niu; Yanan Wang; Masato Taya

arXiv:2601.11995·cs.MM·January 21, 2026

Learning Audio-Visual Embeddings with Inferred Latent Interaction Graphs

Donghuo Zeng, Hao Niu, Yanan Wang, Masato Taya

PDF

Open Access

TL;DR

This paper introduces a novel framework for learning audio-visual embeddings that leverages soft-label predictions and inferred dependency graphs to improve semantic alignment and robustness against incidental co-occurrences.

Contribution

It proposes a new method combining soft-label alignment, latent interaction graphs, and a regularizer to better capture true semantic relationships in audio-visual data.

Findings

01

Improved mean average precision on AVE and VEGAS benchmarks.

02

Effective identification of directional class dependencies.

03

Enhanced robustness to background noise and unannotated events.

Abstract

Learning robust audio-visual embeddings requires bringing genuinely related audio and visual signals together while filtering out incidental co-occurrences - background noise, unrelated elements, or unannotated events. Most contrastive and triplet-loss methods use sparse annotated labels per clip and treat any co-occurrence as semantic similarity. For example, a video labeled "train" might also contain motorcycle audio and visual, because "motorcycle" is not the chosen annotation; standard methods treat these co-occurrences as negatives to true motorcycle anchors elsewhere, creating false negatives and missing true cross-modal dependencies. We propose a framework that leverages soft-label predictions and inferred latent interactions to address these issues: (1) Audio-Visual Semantic Alignment Loss (AV-SAL) trains a teacher network to produce aligned soft-label distributions across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Music and Audio Processing