Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video   Parsing

Haoyue Cheng; Zhaoyang Liu; Hang Zhou; Chen Qian; Wayne Wu; Limin Wang

arXiv:2204.11573·cs.CV·August 2, 2022

Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing

Haoyue Cheng, Zhaoyang Liu, Hang Zhou, Chen Qian, Wayne Wu, Limin Wang

PDF

Open Access 2 Repos

TL;DR

This paper introduces a dynamic label denoising strategy for weakly-supervised audio-visual video parsing, effectively identifying and removing modality-specific noisy labels to improve event recognition and localization accuracy.

Contribution

The work proposes a novel training strategy that dynamically identifies and removes modality-specific noisy labels based on loss relationships, enhancing weakly-supervised video parsing performance.

Findings

01

Significant performance improvements over previous methods.

02

Effective noise ratio estimation method.

03

Validated on benchmark datasets with publicly available code.

Abstract

This paper focuses on the weakly-supervised audio-visual video parsing task, which aims to recognize all events belonging to each modality and localize their temporal boundaries. This task is challenging because only overall labels indicating the video events are provided for training. However, an event might be labeled but not appear in one of the modalities, which results in a modality-specific noisy label problem. In this work, we propose a training strategy to identify and remove modality-specific noisy labels dynamically. It is motivated by two key observations: 1) networks tend to learn clean samples first; and 2) a labeled event would appear in at least one modality. Specifically, we sort the losses of all instances within a mini-batch individually in each modality, and then select noisy samples according to the relationships between intra-modal and inter-modal losses. Besides,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Video Analysis and Summarization · Cancer-related molecular mechanisms research