MUG: Pseudo Labeling Augmented Audio-Visual Mamba Network for Audio-Visual Video Parsing

Langyu Wang; Bingke Zhu; Yingying Chen; Yiyuan Zhang; Ming Tang; Jinqiao Wang

arXiv:2507.01384·cs.CV·August 13, 2025

MUG: Pseudo Labeling Augmented Audio-Visual Mamba Network for Audio-Visual Video Parsing

Langyu Wang, Bingke Zhu, Yingying Chen, Yiyuan Zhang, Ming Tang, Jinqiao Wang

PDF

Open Access

TL;DR

This paper introduces MUG, a novel audio-visual network with pseudo-labeling and augmentation techniques that improve weakly-supervised video parsing by enhancing segment and event-level predictions, achieving state-of-the-art results.

Contribution

The paper proposes a new Mamba network with pseudo-labeling and cross-modal augmentation to better handle segment-level and event-level predictions in audio-visual video parsing.

Findings

01

Improves state-of-the-art on LLP dataset in all metrics.

02

Enhances segment-level and event-level prediction accuracy.

03

Demonstrates effectiveness of pseudo-labeling and augmentation techniques.

Abstract

The weakly-supervised audio-visual video parsing (AVVP) aims to predict all modality-specific events and locate their temporal boundaries. Despite significant progress, due to the limitations of the weakly-supervised and the deficiencies of the model architecture, existing methods are lacking in simultaneously improving both the segment-level prediction and the event-level prediction. In this work, we propose a audio-visual Mamba network with pseudo labeling aUGmentation (MUG) for emphasising the uniqueness of each segment and excluding the noise interference from the alternate modalities. Specifically, we annotate some of the pseudo-labels based on previous work. Using unimodal pseudo-labels, we perform cross-modal random combinations to generate new data, which can enhance the model's ability to parse various segment-level event combinations. For feature processing and interaction, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing