On Class Separability Pitfalls In Audio-Text Contrastive Zero-Shot Learning
Tiago Tavares, Fabio Ayres, Zhepei Wang, Paris Smaragdis

TL;DR
This paper investigates how audio-text contrastive learning for zero-shot tasks can be biased by pre-trained backbones and data leakage, revealing that much of the apparent accuracy stems from unimodal strengths rather than true cross-modal transfer.
Contribution
It uncovers the impact of backbone biases and data leakage on zero-shot audio-text contrastive learning performance, highlighting pitfalls in current evaluation methods.
Findings
Significant zero-shot accuracy is due to backbone strengths, not cross-modal learning.
Unintentional data leakage inflates performance metrics.
Backbone pre-training heavily influences zero-shot results.
Abstract
Recent advances in audio-text cross-modal contrastive learning have shown its potential towards zero-shot learning. One possibility for this is by projecting item embeddings from pre-trained backbone neural networks into a cross-modal space in which item similarity can be calculated in either domain. This process relies on a strong unimodal pre-training of the backbone networks, and on a data-intensive training task for the projectors. These two processes can be biased by unintentional data leakage, which can arise from using supervised learning in pre-training or from inadvertently training the cross-modal projection using labels from the zero-shot learning evaluation. In this study, we show that a significant part of the measured zero-shot learning accuracy is due to strengths inherited from the audio and text backbones, that is, they are not learned in the cross-modal domain and are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGeophysical Methods and Applications · Domain Adaptation and Few-Shot Learning · Ideological and Political Education
MethodsContrastive Learning
