How Well Do Current Speech Deepfake Detection Methods Generalize to the Real World?
Daixian Li, Jun Xue, Yanzhen Ren, Zhuolin Yi, Yihuan Huang, Guanxiang Feng, Yi Chai

TL;DR
This paper evaluates how well current speech deepfake detection methods perform in real-world scenarios, revealing significant generalization challenges across diverse languages and acoustic environments.
Contribution
It introduces the ML-ITW dataset, a comprehensive multilingual in-the-wild benchmark, and systematically assesses the generalization of existing detection methods in real-world conditions.
Findings
Detection performance drops significantly in real-world settings.
Existing methods struggle across multiple languages and acoustic conditions.
The new dataset enables more realistic evaluation of detection models.
Abstract
Recent advances in speech synthesis and voice conversion have greatly improved the naturalness and authenticity of generated audio. Meanwhile, evolving encoding, compression, and transmission mechanisms on social media platforms further obscure deepfake artifacts. These factors complicate reliable detection in real-world environments, underscoring the need for representative evaluation benchmarks. To this end, we introduce ML-ITW (Multilingual In-The-Wild), a multilingual dataset covering 14 languages, seven major platforms, and 180 public figures, totaling 28.39 hours of audio. We evaluate three detection paradigms: end-to-end neural models, self-supervised feature-based (SSL) methods, and audio large language models (Audio LLMs). Experimental results reveal significant performance degradation across diverse languages and real-world acoustic conditions, highlighting the limited…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis
