LINK: Adaptive Modality Interaction for Audio-Visual Video Parsing

Langyu Wang; Bingke Zhu; Yingying Chen; Jinqiao Wang

arXiv:2412.20872·cs.CV·January 3, 2025

LINK: Adaptive Modality Interaction for Audio-Visual Video Parsing

Langyu Wang, Bingke Zhu, Yingying Chen, Jinqiao Wang

PDF

Open Access

TL;DR

This paper introduces LINK, a novel adaptive modality interaction method for audio-visual video parsing that dynamically balances modal contributions and uses semantic pseudo-labels to reduce noise, improving performance on the LLP dataset.

Contribution

We propose LINK, an adaptive interaction framework that addresses modality misalignment and noise in audio-visual parsing, enhancing accuracy over existing methods.

Findings

01

Outperforms existing methods on the LLP dataset

02

Effectively balances contributions of audio and visual modalities

03

Reduces noise using semantic pseudo-labels

Abstract

Audio-visual video parsing focuses on classifying videos through weak labels while identifying events as either visible, audible, or both, alongside their respective temporal boundaries. Many methods ignore that different modalities often lack alignment, thereby introducing extra noise during modal interaction. In this work, we introduce a Learning Interaction method for Non-aligned Knowledge (LINK), designed to equilibrate the contributions of distinct modalities by dynamically adjusting their input during event prediction. Additionally, we leverage the semantic information of pseudo-labels as a priori knowledge to mitigate noise from other modalities. Our experimental findings demonstrate that our model outperforms existing methods on the LLP dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Music and Audio Processing · Subtitles and Audiovisual Media