Label-anticipated Event Disentanglement for Audio-Visual Video Parsing

Jinxing Zhou; Dan Guo; Yuxin Mao; Yiran Zhong; Xiaojun Chang; Meng; Wang

arXiv:2407.08126·cs.AI·July 12, 2024·1 cites

Label-anticipated Event Disentanglement for Audio-Visual Video Parsing

Jinxing Zhou, Dan Guo, Yuxin Mao, Yiran Zhong, Xiaojun Chang, Meng, Wang

PDF

Open Access

TL;DR

This paper introduces LEAP, a new decoding paradigm for audio-visual video parsing that uses label semantics and cross-modal interactions to improve event disentanglement, interpretability, and state-of-the-art performance.

Contribution

It proposes a semantic-based projection decoding method with a novel similarity loss, advancing the decoding phase for better event separation and interpretability in AVVP.

Findings

01

Achieves state-of-the-art AVVP performance

02

Enhances event disentanglement and interpretability

03

Improves audio-visual event localization accuracy

Abstract

Audio-Visual Video Parsing (AVVP) task aims to detect and temporally locate events within audio and visual modalities. Multiple events can overlap in the timeline, making identification challenging. While traditional methods usually focus on improving the early audio-visual encoders to embed more effective features, the decoding phase -- crucial for final event classification, often receives less attention. We aim to advance the decoding phase and improve its interpretability. Specifically, we introduce a new decoding paradigm, \underline{l}abel s\underline{e}m\underline{a}ntic-based \underline{p}rojection (LEAP), that employs labels texts of event categories, each bearing distinct and explicit semantics, for parsing potentially overlapping events.LEAP works by iteratively projecting encoded latent features of audio/visual segments onto semantically independent label embeddings. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Anomaly Detection Techniques and Applications · Speech Recognition and Synthesis

MethodsFocus