Label-anticipated Event Disentanglement for Audio-Visual Video Parsing
Jinxing Zhou, Dan Guo, Yuxin Mao, Yiran Zhong, Xiaojun Chang, Meng, Wang

TL;DR
This paper introduces LEAP, a new decoding paradigm for audio-visual video parsing that uses label semantics and cross-modal interactions to improve event disentanglement, interpretability, and state-of-the-art performance.
Contribution
It proposes a semantic-based projection decoding method with a novel similarity loss, advancing the decoding phase for better event separation and interpretability in AVVP.
Findings
Achieves state-of-the-art AVVP performance
Enhances event disentanglement and interpretability
Improves audio-visual event localization accuracy
Abstract
Audio-Visual Video Parsing (AVVP) task aims to detect and temporally locate events within audio and visual modalities. Multiple events can overlap in the timeline, making identification challenging. While traditional methods usually focus on improving the early audio-visual encoders to embed more effective features, the decoding phase -- crucial for final event classification, often receives less attention. We aim to advance the decoding phase and improve its interpretability. Specifically, we introduce a new decoding paradigm, \underline{l}abel s\underline{e}m\underline{a}ntic-based \underline{p}rojection (LEAP), that employs labels texts of event categories, each bearing distinct and explicit semantics, for parsing potentially overlapping events.LEAP works by iteratively projecting encoded latent features of audio/visual segments onto semantically independent label embeddings. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Anomaly Detection Techniques and Applications · Speech Recognition and Synthesis
MethodsFocus
