Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language   Perspective

Yingying Fan; Yu Wu; Bo Du; Yutian Lin

arXiv:2306.00595·cs.CV·October 31, 2023·2 cites

Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective

Yingying Fan, Yu Wu, Bo Du, Yutian Lin

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel language-based approach for weakly-supervised audio-visual video parsing, addressing segment-level label noise by using language prompts and dynamic re-weighting, leading to significant performance improvements.

Contribution

It proposes a language prompt-based method to handle segment-level label noise and introduces dynamic re-weighting to improve weakly-supervised AVVP performance.

Findings

01

Outperforms state-of-the-art methods significantly

02

Effective handling of segment-level label noise

03

Improved accuracy in event localization

Abstract

We focus on the weakly-supervised audio-visual video parsing task (AVVP), which aims to identify and locate all the events in audio/visual modalities. Previous works only concentrate on video-level overall label denoising across modalities, but overlook the segment-level label noise, where adjacent video segments (i.e., 1-second video clips) may contain different events. However, recognizing events in the segment is challenging because its label could be any combination of events that occur in the video. To address this issue, we consider tackling AVVP from the language perspective, since language could freely describe how various events appear in each segment beyond fixed labels. Specifically, we design language prompts to describe all cases of event appearance for each video. Then, the similarity between language prompts and segments is calculated, where the event of the most similar…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective· slideslive

Taxonomy

TopicsMusic and Audio Processing · Video Analysis and Summarization · Subtitles and Audiovisual Media