Multimodal Class-aware Semantic Enhancement Network for Audio-Visual   Video Parsing

Pengcheng Zhao; Jinxing Zhou; Yang Zhao; Dan Guo; Yanxiang Chen

arXiv:2412.11248·cs.CV·December 18, 2024

Multimodal Class-aware Semantic Enhancement Network for Audio-Visual Video Parsing

Pengcheng Zhao, Jinxing Zhou, Yang Zhao, Dan Guo, Yanxiang Chen

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel multimodal network for audio-visual video parsing that effectively decouples and enhances event semantics, achieving state-of-the-art results by addressing semantic interference issues.

Contribution

It proposes a class-aware feature decoupling module and a fine-grained semantic enhancement framework to improve event recognition and localization in audio-visual videos.

Findings

01

Achieves new state-of-the-art parsing performance.

02

Effectively reduces semantic interference during intra- and cross-modal interactions.

03

Demonstrates significant improvements over prior methods.

Abstract

The Audio-Visual Video Parsing task aims to recognize and temporally localize all events occurring in either the audio or visual stream, or both. Capturing accurate event semantics for each audio/visual segment is vital. Prior works directly utilize the extracted holistic audio and visual features for intra- and cross-modal temporal interactions. However, each segment may contain multiple events, resulting in semantically mixed holistic features that can lead to semantic interference during intra- or cross-modal interactions: the event semantics of one segment may incorporate semantics of unrelated events from other segments. To address this issue, our method begins with a Class-Aware Feature Decoupling (CAFD) module, which explicitly decouples the semantically mixed features into distinct class-wise features, including multiple event-specific features and a dedicated background…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Multimodal Class-aware Semantic Enhancement Network for Audio-Visual Video Parsing· underline

Taxonomy

TopicsVideo Analysis and Summarization · Speech and Audio Processing · Digital Media Forensic Detection