VocSim: A Training-free Benchmark for Zero-shot Content Identity in Single-source Audio
Maris Basha, Anja Zai, Sabine Stoll, Richard Hahnloser

TL;DR
VocSim introduces a training-free benchmark for evaluating zero-shot content identity in single-source audio, revealing strengths and gaps in current audio embeddings across diverse sound types.
Contribution
The paper presents VocSim, a novel training-free benchmark that assesses the intrinsic geometric alignment of frozen audio embeddings for zero-shot content identification.
Findings
Strong zero-shot performance with simple pipeline
Uncovered a generalization gap in unseen phonotactics
Embeddings predict perceptual similarity and improve bioacoustic classification
Abstract
General-purpose audio representations aim to map acoustically variable instances of the same event to nearby points, resolving content identity in a zero-shot setting. Unlike supervised classification benchmarks that measure adaptability via parameter updates, we introduce VocSim, a training-free benchmark probing the intrinsic geometric alignment of frozen embeddings. VocSim aggregates 125k single-source clips from 19 corpora spanning human speech, animal vocalizations, and environmental sounds. By restricting to single-source audio, we isolate content representation from the confound of source separation. We evaluate embeddings using Precision@k for local purity and the Global Separation Rate (GSR) for point-wise class separation. To calibrate GSR, we report lift over an empirical permutation baseline. Across diverse foundation models, a simple pipeline, frozen Whisper encoder…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The paper contains two strengths: First, the motivation is clear as the benchmark focuses specifically on content identity under zero-shot settings in single-source audio, distinguishing it from scene analysis or supervised classification. The data processing is decent. The curated audio is segmented by type, and includes clean vs. noisy domains, different durations, and varied class granularities. This design encourages generalization testing across different acoustic conditions. Second, the
However, there are two major weaknesses that hinder the paper's acceptance. First, the novelty is limited. The benchmark is mostly an aggregation of previous datasets (19 existing corpora) rather than a newly collected dataset with fresh labeling, annotations, or clearly designed structure. Similar aggregation-based benchmark efforts in audio already exist, as also mentioned by authors (e.g., HEAR, SUPERB). Other works have also pursued large aggregated datasets for general-purpose audio simi
* The proposed method evaluates the **intrinsic representation capability** of audio encoders **without any fine-tuning or additional learnable parameters**. In contrast, benchmarks such as **SUPERB** rely on feature aggregation and trainable parameters, making their results sensitive to these design choices. * A key strength of this approach is its use of **zero-shot evaluation**, which provides a more direct and unbiased measure of generalization performance. * The benchmark encompasses a **di
* **Limited task scope:** The benchmark focuses exclusively on **classification-oriented tasks**, which restricts its applicability. Audio encoders and their learned representations are widely used in other important areas, such as **text-to-speech (TTS)**, **speaker diarization**, **speech-to-speech (S2S, especially dialog) systems**, and **speech enhancement or separation**. As a result, the current setup provides only a partial view of encoder performance. * **Incomplete domain coverage:** Th
- The open sourced data and code base are of a high quality. Allows for easy reproduction and further research in the area. - It addresses a key need: better measurement of the acoustic latent space structure.
unstructured notes: - Think you are missing some authors off: https://arxiv.org/abs/2203.03022 (add et al) - In the related work I think you should mention contrastive leaning methods: BYOL/DINO/Barlow twins e.g. https://arxiv.org/pdf/2209.14345 - I am skeptical on the quality of GSK as a useful metric. I agree with your points in lines 397. An addition point might be that as it is so sensitive to outliers, is the GSK not more of a measure of miss labelling rate? A label error will upper bound
- The paper introduces a large-scale dataset for 0-shot audio processing. A key strength is its size and straightforward accessibility via Hugging Face, lowering the barrier for entry for researchers. This increases the potential for reproducibility and follow-up work. - The planned introduction of a public leaderboard might provide a clear framework for standardized evaluation, which might foster further research. - The authors provide a comprehensive benchmark for zero-shot audio tasks.
**1. Critical flaws in references** The paper's credibility is strongly undermined by what appear to be numerous hallucinated or incorrect references. This raises concerns about the validation process behind the related work section and, by extension, the rest of the paper (all datasets, results, etc.). A manual check of the bibliography revealed several major errors, including (but not limited to): - L. 518-521 (Birb): The cited paper does not exist. The actual BIRB benchmark was published by
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Animal Vocal Communication and Behavior
