CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly Supervised Audio-Visual Video Parsing
Faegheh Sardari, Armin Mustafa, Philip J. B. Jackson, Adrian Hilton

TL;DR
This paper introduces CoLeaF, a novel framework for weakly supervised audio-visual video parsing that enhances detection of aligned events while filtering out irrelevant modality information, leading to improved state-of-the-art results.
Contribution
CoLeaF explicitly learns to combine cross-modal information for aligned events and filters out unaligned events, improving weakly supervised AVVP without extra inference costs.
Findings
Achieves 1.9% and 2.4% higher F-score on LLP and UnAV-100 datasets.
Effectively filters irrelevant modality information in weakly supervised settings.
Leverages cross-class relationships during training without additional inference costs.
Abstract
Weakly supervised audio-visual video parsing (AVVP) methods aim to detect audible-only, visible-only, and audible-visible events using only video-level labels. Existing approaches tackle this by leveraging unimodal and cross-modal contexts. However, we argue that while cross-modal learning is beneficial for detecting audible-visible events, in the weakly supervised scenario, it negatively impacts unaligned audible or visible events by introducing irrelevant modality information. In this paper, we propose CoLeaF, a novel learning framework that optimizes the integration of cross-modal context in the embedding space such that the network explicitly learns to combine cross-modal information for audible-visible events while filtering them out for unaligned events. Additionally, as videos often involve complex class relationships, modelling them improves performance. However, this introduces…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Multimodal Machine Learning Applications · Video Analysis and Summarization
