Text-to-feature diffusion for audio-visual few-shot learning
Otniel-Bogdan Mercea, Thomas Hummel, A. Sophia Koepke, Zeynep Akata

TL;DR
This paper introduces a new benchmark and a diffusion-based framework for audio-visual few-shot video classification, demonstrating state-of-the-art results with limited labeled data.
Contribution
It presents a unified benchmark for audio-visual few-shot learning and proposes AV-DIFF, a novel text-to-feature diffusion method for multi-modal feature generation.
Findings
AV-DIFF achieves state-of-the-art performance on the benchmark.
The benchmark enables effective evaluation of audio-visual few-shot learning methods.
The framework leverages cross-modal attention for feature fusion.
Abstract
Training deep learning models for video classification from audio-visual data commonly requires immense amounts of labeled training data collected via a costly process. A challenging and underexplored, yet much cheaper, setup is few-shot learning from video data. In particular, the inherently multi-modal nature of video data with sound and visual information has not been leveraged extensively for the few-shot video classification task. Therefore, we introduce a unified audio-visual few-shot video classification benchmark on three datasets, i.e. the VGGSound-FSL, UCF-FSL, ActivityNet-FSL datasets, where we adapt and compare ten methods. In addition, we propose AV-DIFF, a text-to-feature diffusion framework, which first fuses the temporal and audio-visual features via cross-modal attention and then generates multi-modal features for the novel classes. We show that AV-DIFF obtains…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization
MethodsDiffusion
