Audio Language Model for Deepfake Detection Grounded in Acoustic Chain-of-Thought

Runkun Chen; Yixiong Fang; Pengyu Chang; Yuante Li; Massa Baali; and Bhiksha Raj

arXiv:2603.28021·cs.SD·April 1, 2026

Audio Language Model for Deepfake Detection Grounded in Acoustic Chain-of-Thought

Runkun Chen, Yixiong Fang, Pengyu Chang, Yuante Li, Massa Baali, and Bhiksha Raj

PDF

TL;DR

This paper presents CoLMbo-DF, an audio language model that enhances deepfake speech detection by integrating structured acoustic evidence and reasoning, leading to improved accuracy and interpretability.

Contribution

It introduces a novel feature-guided audio language model with explicit acoustic reasoning and a new dataset with chain-of-thought annotations for deepfake detection.

Findings

01

Outperforms existing baselines in deepfake detection accuracy.

02

Grounds model reasoning in interpretable acoustic evidence.

03

Uses a lightweight language model with significant improvements.

Abstract

Deepfake speech detection systems are often limited to binary classification tasks and struggle to generate interpretable reasoning or provide context-rich explanations for their decisions. These models primarily extract latent embeddings for authenticity detection but fail to leverage structured acoustic evidence such as prosodic, spectral, and physiological attributes in a meaningful manner. This paper introduces CoLMbo-DF, a Feature-Guided Audio Language Model that addresses these limitations by integrating robust deepfake detection with explicit acoustic chain-of-thought reasoning. By injecting structured textual representations of low-level acoustic features directly into the model prompt, our approach grounds the model's reasoning in interpretable evidence and improves detection accuracy. To support this framework, we introduce a novel dataset of audio pairs paired with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.