CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly   Supervised Audio-Visual Video Parsing

Faegheh Sardari; Armin Mustafa; Philip J. B. Jackson; Adrian Hilton

arXiv:2405.10690·cs.CV·July 16, 2024·1 cites

CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly Supervised Audio-Visual Video Parsing

Faegheh Sardari, Armin Mustafa, Philip J. B. Jackson, Adrian Hilton

PDF

Open Access 1 Repo

TL;DR

This paper introduces CoLeaF, a novel framework for weakly supervised audio-visual video parsing that enhances detection of aligned events while filtering out irrelevant modality information, leading to improved state-of-the-art results.

Contribution

CoLeaF explicitly learns to combine cross-modal information for aligned events and filters out unaligned events, improving weakly supervised AVVP without extra inference costs.

Findings

01

Achieves 1.9% and 2.4% higher F-score on LLP and UnAV-100 datasets.

02

Effectively filters irrelevant modality information in weakly supervised settings.

03

Leverages cross-class relationships during training without additional inference costs.

Abstract

Weakly supervised audio-visual video parsing (AVVP) methods aim to detect audible-only, visible-only, and audible-visible events using only video-level labels. Existing approaches tackle this by leveraging unimodal and cross-modal contexts. However, we argue that while cross-modal learning is beneficial for detecting audible-visible events, in the weakly supervised scenario, it negatively impacts unaligned audible or visible events by introducing irrelevant modality information. In this paper, we propose CoLeaF, a novel learning framework that optimizes the integration of cross-modal context in the embedding space such that the network explicitly learns to combine cross-modal information for audible-visible events while filtering them out for unaligned events. Additionally, as videos often involve complex class relationships, modelling them improves performance. However, this introduces…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

faeghehsardari/coleaf
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Multimodal Machine Learning Applications · Video Analysis and Summarization