Audio Language Model for Deepfake Detection Grounded in Acoustic Chain-of-Thought
Runkun Chen, Yixiong Fang, Pengyu Chang, Yuante Li, Massa Baali, and Bhiksha Raj

TL;DR
This paper presents CoLMbo-DF, an audio language model that enhances deepfake speech detection by integrating structured acoustic evidence and reasoning, leading to improved accuracy and interpretability.
Contribution
It introduces a novel feature-guided audio language model with explicit acoustic reasoning and a new dataset with chain-of-thought annotations for deepfake detection.
Findings
Outperforms existing baselines in deepfake detection accuracy.
Grounds model reasoning in interpretable acoustic evidence.
Uses a lightweight language model with significant improvements.
Abstract
Deepfake speech detection systems are often limited to binary classification tasks and struggle to generate interpretable reasoning or provide context-rich explanations for their decisions. These models primarily extract latent embeddings for authenticity detection but fail to leverage structured acoustic evidence such as prosodic, spectral, and physiological attributes in a meaningful manner. This paper introduces CoLMbo-DF, a Feature-Guided Audio Language Model that addresses these limitations by integrating robust deepfake detection with explicit acoustic chain-of-thought reasoning. By injecting structured textual representations of low-level acoustic features directly into the model prompt, our approach grounds the model's reasoning in interpretable evidence and improves detection accuracy. To support this framework, we introduce a novel dataset of audio pairs paired with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
