Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective
Yingying Fan, Yu Wu, Bo Du, Yutian Lin

TL;DR
This paper introduces a novel language-based approach for weakly-supervised audio-visual video parsing, addressing segment-level label noise by using language prompts and dynamic re-weighting, leading to significant performance improvements.
Contribution
It proposes a language prompt-based method to handle segment-level label noise and introduces dynamic re-weighting to improve weakly-supervised AVVP performance.
Findings
Outperforms state-of-the-art methods significantly
Effective handling of segment-level label noise
Improved accuracy in event localization
Abstract
We focus on the weakly-supervised audio-visual video parsing task (AVVP), which aims to identify and locate all the events in audio/visual modalities. Previous works only concentrate on video-level overall label denoising across modalities, but overlook the segment-level label noise, where adjacent video segments (i.e., 1-second video clips) may contain different events. However, recognizing events in the segment is challenging because its label could be any combination of events that occur in the video. To address this issue, we consider tackling AVVP from the language perspective, since language could freely describe how various events appear in each segment beyond fixed labels. Specifically, we design language prompts to describe all cases of event appearance for each video. Then, the similarity between language prompts and segments is calculated, where the event of the most similar…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMusic and Audio Processing · Video Analysis and Summarization · Subtitles and Audiovisual Media
