Audio Deepfake Detection in the Age of Advanced Text-to-Speech models

Robin Singh; Aditya Yogesh Nair; Fabio Palumbo; Florian Barbaro; Anna Dyka; Lohith Rachakonda

arXiv:2601.20510·cs.SD·January 29, 2026

Audio Deepfake Detection in the Age of Advanced Text-to-Speech models

Robin Singh, Aditya Yogesh Nair, Fabio Palumbo, Florian Barbaro, Anna Dyka, Lohith Rachakonda

PDF

Open Access 1 Datasets

TL;DR

This paper evaluates the effectiveness of various detection methods against synthetic speech generated by advanced TTS models, highlighting the need for integrated detection strategies due to variability in model performance.

Contribution

It provides a comprehensive comparison of detection frameworks across multiple TTS architectures and proposes a multi-view approach for robust deepfake audio detection.

Findings

01

Detection performance varies significantly across TTS models.

02

Single-paradigm detectors are often ineffective against certain architectures.

03

Multi-view detection strategies offer consistent robustness.

Abstract

Recent advances in Text-to-Speech (TTS) systems have substantially increased the realism of synthetic speech, raising new challenges for audio deepfake detection. This work presents a comparative evaluation of three state-of-the-art TTS models--Dia2, Maya1, and MeloTTS--representing streaming, LLM-based, and non-autoregressive architectures. A corpus of 12,000 synthetic audio samples was generated using the Daily-Dialog dataset and evaluated against four detection frameworks, including semantic, structural, and signal-level approaches. The results reveal significant variability in detector performance across generative mechanisms: models effective against one TTS architecture may fail against others, particularly LLM-based synthesis. In contrast, a multi-view detection approach combining complementary analysis levels demonstrates robust performance across all evaluated models. These…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

UncovAI/UncovAI_TTS
dataset· 703 dl
703 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection · Speech and Audio Processing