Towards Robust Speech Deepfake Detection via Human-Inspired Reasoning

Artem Dvirniak; Evgeny Kushnir; Dmitrii Tarasov; Artem Iudin; Oleg Kiriukhin; Mikhail Pautov; Dmitrii Korzh; Oleg Y. Rogov

arXiv:2603.10725·cs.SD·March 13, 2026

Towards Robust Speech Deepfake Detection via Human-Inspired Reasoning

Artem Dvirniak, Evgeny Kushnir, Dmitrii Tarasov, Artem Iudin, Oleg Kiriukhin, Mikhail Pautov, Dmitrii Korzh, Oleg Y. Rogov

PDF

Open Access

TL;DR

This paper introduces HIR-SDD, a speech deepfake detection framework that leverages human-inspired reasoning and large audio language models to improve detection accuracy and interpretability across diverse audio domains.

Contribution

The paper presents a novel SDD approach combining LALMs with human-annotated reasoning chains, enhancing generalization and interpretability.

Findings

01

Effective detection across multiple audio domains

02

Provides human-like explanations for predictions

03

Improves robustness over existing methods

Abstract

The modern generative audio models can be used by an adversary in an unlawful manner, specifically, to impersonate other people to gain access to private information. To mitigate this issue, speech deepfake detection (SDD) methods started to evolve. Unfortunately, current SDD methods generally suffer from the lack of generalization to new audio domains and generators. More than that, they lack interpretability, especially human-like reasoning that would naturally explain the attribution of a given audio to the bona fide or spoof class and provide human-perceptible cues. In this paper, we propose HIR-SDD, a novel SDD framework that combines the strengths of Large Audio Language Models (LALMs) with the chain-of-thought reasoning derived from the novel proposed human-annotated dataset. Experimental evaluation demonstrates both the effectiveness of the proposed method and its ability to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Speech Recognition and Synthesis · Speech and Audio Processing