Audio Contrastive-based Fine-tuning: Decoupling Representation Learning and Classification

Yang Wang; Qibin Liang; Chenghao Xiao; Yizhi Li; Noura Al Moubayed; Chenghua Lin

arXiv:2309.11895·cs.SD·September 23, 2025·1 cites

Audio Contrastive-based Fine-tuning: Decoupling Representation Learning and Classification

Yang Wang, Qibin Liang, Chenghao Xiao, Yizhi Li, Noura Al Moubayed, Chenghua Lin

PDF

Open Access

TL;DR

This paper proposes a two-stage fine-tuning framework for pre-trained audio models that separates representation learning from classifier training, using contrastive tuning and dual-probe evaluation to improve and analyze embedding quality.

Contribution

It introduces a disentangled two-stage fine-tuning approach and a dual-probe evaluation protocol, enhancing understanding and performance of audio model representations.

Findings

01

Improved accuracy on diverse audio classification tasks.

02

Superior embedding space quality revealed by dual-probing.

03

Outperforms vanilla fine-tuning and strong baselines on multiple datasets.

Abstract

Standard fine-tuning of pre-trained audio models couples representation learning with classifier training, which can obscure the true quality of the learned representations. In this work, we advocate for a disentangled two-stage framework that separates representation refinement from downstream evaluation. First, we employ a "contrastive-tuning" stage to explicitly improve the geometric structure of the model's embedding space. Subsequently, we introduce a dual-probe evaluation protocol to assess the quality of these refined representations from a geometric perspective. This protocol uses a linear probe to measure global linear separability and a k-Nearest Neighbours probe to investigate the local structure of class clusters. Our experiments on a diverse set of audio classification tasks show that our framework provides a better foundation for classification, leading to improved…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis