Who calls the shots? Rethinking Few-Shot Learning for Audio
Yu Wang, Nicholas J. Bryan, Justin Salamon, Mark Cartwright, Juan, Pablo Bello

TL;DR
This paper investigates the unique challenges of few-shot learning in audio recognition, emphasizing the importance of application-specific approaches due to audio's multi-label nature and properties like polyphony and SNR.
Contribution
It introduces two new datasets for audio few-shot learning and provides audio-specific insights that challenge assumptions from image-based few-shot learning research.
Findings
No single best model or method for all audio scenarios
Support set selection depends on application context
Audio properties significantly influence few-shot learning performance
Abstract
Few-shot learning aims to train models that can recognize novel classes given just a handful of labeled examples, known as the support set. While the field has seen notable advances in recent years, they have often focused on multi-class image classification. Audio, in contrast, is often multi-label due to overlapping sounds, resulting in unique properties such as polyphony and signal-to-noise ratios (SNR). This leads to unanswered questions concerning the impact such audio properties may have on few-shot learning system design, performance, and human-computer interaction, as it is typically up to the user to collect and provide inference-time support set examples. We address these questions through a series of experiments designed to elucidate the answers to these questions. We introduce two novel datasets, FSD-MIX-CLIPS and FSD-MIX-SED, whose programmatic generation allows us to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
