UWAV: Uncertainty-weighted Weakly-supervised Audio-Visual Video Parsing
Yung-Hsuan Lai, Janek Ebbers, Yu-Chiang Frank Wang, Fran\c{c}ois Germain, Michael Jeffrey Jones, Moitreya Chatterjee

TL;DR
This paper introduces UWAV, a novel weakly-supervised audio-visual video parsing method that models uncertainty in pseudo-labels and uses feature mixup, significantly improving performance over existing approaches.
Contribution
The paper presents UWAV, a new approach that incorporates uncertainty estimation and feature mixup regularization to enhance weakly-supervised AVVP performance.
Findings
UWAV outperforms state-of-the-art methods on multiple metrics.
The approach demonstrates strong generalizability across datasets.
Incorporating uncertainty improves pseudo-label quality.
Abstract
Audio-Visual Video Parsing (AVVP) entails the challenging task of localizing both uni-modal events (i.e., those occurring exclusively in either the visual or acoustic modality of a video) and multi-modal events (i.e., those occurring in both modalities concurrently). Moreover, the prohibitive cost of annotating training data with the class labels of all these events, along with their start and end times, imposes constraints on the scalability of AVVP techniques unless they can be trained in a weakly-supervised setting, where only modality-agnostic, video-level labels are available in the training data. To this end, recently proposed approaches seek to generate segment-level pseudo-labels to better guide model training. However, the absence of inter-segment dependencies when generating these pseudo-labels and the general bias towards predicting labels that are absent in a segment limit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Image and Signal Denoising Methods
MethodsMixup
