Similarity Choice and Negative Scaling in Supervised Contrastive Learning for Deepfake Audio Detection
Jaskirat Sudan, Hashim Ali, Surya Subramani, Hafiz Malik

TL;DR
This paper investigates supervised contrastive learning for deepfake audio detection, exploring similarity measures and negative scaling, and demonstrates improved performance with cosine similarity and delayed negative queues.
Contribution
It provides a controlled study on contrastive learning variations specifically for audio deepfake detection, highlighting the effectiveness of cosine similarity and negative queue strategies.
Findings
Cosine similarity with delayed queue achieves the best ITW EER of 8.29%.
Angular similarity performs well without queued negatives, with ITW 8.70%.
Reduced reliance on large negative sets improves detection performance.
Abstract
Supervised contrastive learning (SupCon) is widely used to shape representations, but has seen limited targeted study for audio deepfake detection. Existing work typically combines contrastive terms with broader pipelines; however, the focus on SupCon itself is missing. In this work, we run a controlled study on wav2vec2 XLS-R (300M) that varies (i) similarity in SupCon (cosine vs angular similarity derived from the hyperspherical angle) and (ii) negative scaling using a warm-started global cross-batch queue. Stage 1 fine-tunes the encoder and projection head with SupCon; Stage 2 freezes them and trains a linear classifier with BCE. Trained on ASVspoof 2019 LA and evaluated on ASV19 eval plus ITW and ASVspoof 2021 DF/LA, Cosine SupCon with a delayed queue achieves the best ITW EER (8.29%) and pooled EER (4.44), while angular similarity performs strongly without queued negatives (ITW…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
