Interpretable All-Type Audio Deepfake Detection with Audio LLMs via Frequency-Time Reinforcement Learning

Yuankun Xie; Xiaoxuan Guo; Jiayi Zhou; Tao Wang; Jian Liu; Ruibo Fu; Xiaopeng Wang; Haonan Cheng; Long Ye

arXiv:2601.02983·cs.SD·January 7, 2026

Interpretable All-Type Audio Deepfake Detection with Audio LLMs via Frequency-Time Reinforcement Learning

Yuankun Xie, Xiaoxuan Guo, Jiayi Zhou, Tao Wang, Jian Liu, Ruibo Fu, Xiaopeng Wang, Haonan Cheng, Long Ye

PDF

Open Access

TL;DR

This paper introduces a novel training paradigm for all-type audio deepfake detection using audio large language models, combining supervised fine-tuning and reinforcement learning with frequency-time structured rationales to improve performance and interpretability.

Contribution

It proposes an innovative pipeline with frequency-time rationales and a two-stage training method (SFT and FT-GRPO) for interpretable, all-type audio deepfake detection.

Findings

01

Achieves state-of-the-art performance on all-type ADD

02

Produces interpretable, frequency-time grounded rationales

03

Demonstrates effective generalization across heterogeneous audio types

Abstract

Recent advances in audio large language models (ALLMs) have made high-quality synthetic audio widely accessible, increasing the risk of malicious audio deepfakes across speech, environmental sounds, singing voice, and music. Real-world audio deepfake detection (ADD) therefore requires all-type detectors that generalize across heterogeneous audio and provide interpretable decisions. Given the strong multi-task generalization ability of ALLMs, we first investigate their performance on all-type ADD under both supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). However, SFT using only binary real/fake labels tends to reduce the model to a black-box classifier, sacrificing interpretability. Meanwhile, vanilla RFT under sparse supervision is prone to reward hacking and can produce hallucinated, ungrounded rationales. To address this, we propose an automatic annotation and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Music and Audio Processing · Speech Recognition and Synthesis