Towards Explicit Acoustic Evidence Perception in Audio LLMs for Speech Deepfake Detection

Xiaoxuan Guo; Yuankun Xie; Haonan Cheng; Jiayi Zhou; Jian Liu; Hengyan Huang; Long Ye; Qin Zhang

arXiv:2601.23066·cs.SD·February 2, 2026

Towards Explicit Acoustic Evidence Perception in Audio LLMs for Speech Deepfake Detection

Xiaoxuan Guo, Yuankun Xie, Haonan Cheng, Jiayi Zhou, Jian Liu, Hengyan Huang, Long Ye, Qin Zhang

PDF

Open Access

TL;DR

This paper introduces SDD-APALLM, an acoustically enhanced audio LLM framework that explicitly exposes fine-grained acoustic cues, significantly improving speech deepfake detection accuracy and robustness by combining raw audio and spectrograms.

Contribution

The paper presents a novel framework that enhances audio LLMs with explicit acoustic evidence exposure, addressing the bias towards semantic cues in deepfake detection.

Findings

01

Improved detection accuracy and robustness in speech deepfake detection.

02

Effective utilization of both semantic and acoustic cues.

03

Enhanced detection performance especially when semantic cues are misleading.

Abstract

Speech deepfake detection (SDD) focuses on identifying whether a given speech signal is genuine or has been synthetically generated. Existing audio large language model (LLM)-based methods excel in content understanding; however, their predictions are often biased toward semantically correlated cues, which results in fine-grained acoustic artifacts being overlooked during the decisionmaking process. Consequently, fake speech with natural semantics can bypass detectors despite harboring subtle acoustic anomalies; this suggests that the challenge stems not from the absence of acoustic data, but from its inadequate accessibility when semantic-dominant reasoning prevails. To address this issue, we investigate SDD within the audio LLM paradigm and introduce SDD with Auditory Perception-enhanced Audio Large Language Model (SDD-APALLM), an acoustically enhanced framework designed to explicitly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Speech and Audio Processing