Multimodal Imbalance-Aware Gradient Modulation for Weakly-supervised Audio-Visual Video Parsing
Jie Fu, Junyu Gao, Changsheng Xu

TL;DR
This paper introduces a dynamic gradient modulation mechanism with a modality-separated decision unit to address imbalanced feature learning in weakly-supervised audio-visual video parsing, improving localization and classification accuracy.
Contribution
It proposes a novel DGM mechanism and MSDU to effectively balance feature learning between audio and visual modalities in WS-AVVP.
Findings
Improved localization accuracy on benchmark datasets.
Enhanced modality balance in feature learning.
Demonstrated effectiveness over existing methods.
Abstract
Weakly-supervised audio-visual video parsing (WS-AVVP) aims to localize the temporal extents of audio, visual and audio-visual event instances as well as identify the corresponding event categories with only video-level category labels for training. Most previous methods pay much attention to refining the supervision for each modality or extracting fruitful cross-modality information for more reliable feature learning. None of them have noticed the imbalanced feature learning between different modalities in the task. In this paper, to balance the feature learning processes of different modalities, a dynamic gradient modulation (DGM) mechanism is explored, where a novel and effective metric function is designed to measure the imbalanced feature learning between audio and visual modalities. Furthermore, principle analysis indicates that the multimodal confusing calculation will hamper the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCancer-related molecular mechanisms research · Subtitles and Audiovisual Media · Digital Media Forensic Detection
MethodsNone
