Improving Audio Classification by Transitioning from Zero- to Few-Shot
James Taylor, Wolfgang Mack

TL;DR
This paper explores transitioning from zero-shot to few-shot learning in audio classification, showing that few-shot methods improve accuracy by refining audio embeddings and reducing noise compared to zero-shot approaches.
Contribution
The paper introduces a few-shot classification method that enhances audio classification accuracy by replacing noisy text embeddings with grouped audio embeddings.
Findings
Few-shot classification outperforms zero-shot baseline.
Grouping audio embeddings reduces noise in class representations.
Refined embeddings lead to improved classification accuracy.
Abstract
State-of-the-art audio classification often employs a zero-shot approach, which involves comparing audio embeddings with embeddings from text describing the respective audio class. These embeddings are usually generated by neural networks trained through contrastive learning to align audio and text representations. Identifying the optimal text description for an audio class is challenging, particularly when the class comprises a wide variety of sounds. This paper examines few-shot methods designed to improve classification accuracy beyond the zero-shot approach. Specifically, audio embeddings are grouped by class and processed to replace the inherently noisy text embeddings. Our results demonstrate that few-shot classification typically outperforms the zero-shot baseline.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
