EAR: Enhancing Uni-Modal Representations for Weakly Supervised Audio-Visual Video Parsing

Huilai Li; Xiaomeng Di; Ying Xing; Yonghao Dang; Yiming Wang; Jianqin Yin

arXiv:2605.08723·cs.CV·May 12, 2026

EAR: Enhancing Uni-Modal Representations for Weakly Supervised Audio-Visual Video Parsing

Huilai Li, Xiaomeng Di, Ying Xing, Yonghao Dang, Yiming Wang, Jianqin Yin

PDF

TL;DR

This paper introduces a novel framework that enhances uni-modal representations to improve weakly supervised audio-visual video parsing, addressing the limitations of existing multi-modal focused strategies.

Contribution

It proposes a similarity-based label migration and a soft-constrained approach to better preserve uni-modal semantics during video parsing.

Findings

01

Outperforms state-of-the-art methods in pseudo-label accuracy.

02

Achieves superior localization of audio, visual, and audio-visual events.

03

Enhances the understanding of uni-modal events for better video parsing.

Abstract

Weakly supervised Audio-Visual Video Parsing (AVVP) aims to recognize and temporally localize audio, visual, and audio-visual events in videos using only coarse-grained labels. Faced with the challenging task settings, existing research advances along two main paths: pre-training pseudo-label generators for fine-grained cross-modal semantic guidance, or refining AVVP model architectures to enhance audio-visual fusion. However, since audio and visual signals are typically unaligned, achieving accurate video parsing fundamentally relies on precise perception of uni-modal events. Yet these multi-modal focused strategies excessively emphasize multi-modal fusion while inadequately guiding and preserving uni-modal semantics, resulting in noisy pseudo-labels and sub-optimal video parsing performance. This paper proposes a novel framework that enhances uni-modal representations for both the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.