Text-to-feature diffusion for audio-visual few-shot learning

Otniel-Bogdan Mercea; Thomas Hummel; A. Sophia Koepke; Zeynep Akata

arXiv:2309.03869·cs.CV·September 8, 2023

Text-to-feature diffusion for audio-visual few-shot learning

Otniel-Bogdan Mercea, Thomas Hummel, A. Sophia Koepke, Zeynep Akata

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new benchmark and a diffusion-based framework for audio-visual few-shot video classification, demonstrating state-of-the-art results with limited labeled data.

Contribution

It presents a unified benchmark for audio-visual few-shot learning and proposes AV-DIFF, a novel text-to-feature diffusion method for multi-modal feature generation.

Findings

01

AV-DIFF achieves state-of-the-art performance on the benchmark.

02

The benchmark enables effective evaluation of audio-visual few-shot learning methods.

03

The framework leverages cross-modal attention for feature fusion.

Abstract

Training deep learning models for video classification from audio-visual data commonly requires immense amounts of labeled training data collected via a costly process. A challenging and underexplored, yet much cheaper, setup is few-shot learning from video data. In particular, the inherently multi-modal nature of video data with sound and visual information has not been leveraged extensively for the few-shot video classification task. Therefore, we introduce a unified audio-visual few-shot video classification benchmark on three datasets, i.e. the VGGSound-FSL, UCF-FSL, ActivityNet-FSL datasets, where we adapt and compare ten methods. In addition, we propose AV-DIFF, a text-to-feature diffusion framework, which first fuses the temporal and audio-visual features via cross-modal attention and then generates multi-modal features for the novel classes. We show that AV-DIFF obtains…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

explainableml/avdiff-gfsl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization

MethodsDiffusion