MUG: Pseudo Labeling Augmented Audio-Visual Mamba Network for Audio-Visual Video Parsing
Langyu Wang, Bingke Zhu, Yingying Chen, Yiyuan Zhang, Ming Tang, Jinqiao Wang

TL;DR
This paper introduces MUG, a novel audio-visual network with pseudo-labeling and augmentation techniques that improve weakly-supervised video parsing by enhancing segment and event-level predictions, achieving state-of-the-art results.
Contribution
The paper proposes a new Mamba network with pseudo-labeling and cross-modal augmentation to better handle segment-level and event-level predictions in audio-visual video parsing.
Findings
Improves state-of-the-art on LLP dataset in all metrics.
Enhances segment-level and event-level prediction accuracy.
Demonstrates effectiveness of pseudo-labeling and augmentation techniques.
Abstract
The weakly-supervised audio-visual video parsing (AVVP) aims to predict all modality-specific events and locate their temporal boundaries. Despite significant progress, due to the limitations of the weakly-supervised and the deficiencies of the model architecture, existing methods are lacking in simultaneously improving both the segment-level prediction and the event-level prediction. In this work, we propose a audio-visual Mamba network with pseudo labeling aUGmentation (MUG) for emphasising the uniqueness of each segment and excluding the noise interference from the alternate modalities. Specifically, we annotate some of the pseudo-labels based on previous work. Using unimodal pseudo-labels, we perform cross-modal random combinations to generate new data, which can enhance the model's ability to parse various segment-level event combinations. For feature processing and interaction, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing
