Multimodal Class-aware Semantic Enhancement Network for Audio-Visual Video Parsing
Pengcheng Zhao, Jinxing Zhou, Yang Zhao, Dan Guo, Yanxiang Chen

TL;DR
This paper introduces a novel multimodal network for audio-visual video parsing that effectively decouples and enhances event semantics, achieving state-of-the-art results by addressing semantic interference issues.
Contribution
It proposes a class-aware feature decoupling module and a fine-grained semantic enhancement framework to improve event recognition and localization in audio-visual videos.
Findings
Achieves new state-of-the-art parsing performance.
Effectively reduces semantic interference during intra- and cross-modal interactions.
Demonstrates significant improvements over prior methods.
Abstract
The Audio-Visual Video Parsing task aims to recognize and temporally localize all events occurring in either the audio or visual stream, or both. Capturing accurate event semantics for each audio/visual segment is vital. Prior works directly utilize the extracted holistic audio and visual features for intra- and cross-modal temporal interactions. However, each segment may contain multiple events, resulting in semantically mixed holistic features that can lead to semantic interference during intra- or cross-modal interactions: the event semantics of one segment may incorporate semantics of unrelated events from other segments. To address this issue, our method begins with a Class-Aware Feature Decoupling (CAFD) module, which explicitly decouples the semantically mixed features into distinct class-wise features, including multiple event-specific features and a dedicated background…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsVideo Analysis and Summarization · Speech and Audio Processing · Digital Media Forensic Detection
