What Counts as Real? Speech Restoration and Voice Quality Conversion Pose New Challenges to Deepfake Detection
Shree Harsha Bokkahalli Satish, Harm Lameris, Joakim Gustafson, \'Eva Sz\'ekely

TL;DR
This paper investigates how speech restoration and voice conversion challenge deepfake detection systems by causing distributional shifts that lead to misclassification, proposing a multi-class approach for improved robustness.
Contribution
It introduces a multi-class formulation for audio anti-spoofing that better handles benign transformations and maintains spoof detection capabilities.
Findings
Benign transformations cause distributional drift in SSL embeddings.
Multi-class modeling improves robustness to benign shifts.
Binary systems tend to model raw speech distribution rather than authenticity.
Abstract
Audio anti-spoofing systems are typically formulated as binary classifiers distinguishing bona fide from spoofed speech. This assumption fails under layered generative processing, where benign transformations introduce distributional shifts that are misclassified as spoofing. We show that phonation-modifying voice conversion and speech restoration are treated as out-of-distribution despite preserving speaker authenticity. Using a multi-class setup separating bona fide, converted, spoofed, and converted-spoofed speech, we analyse model behaviour through self-supervised learning (SSL) embeddings and acoustic correlates. The benign transformations induce a drift in the SSL space, compressing bona fide and spoofed speech and reducing classifier separability. Reformulating anti-spoofing as a multi-class problem improves robustness to benign shifts while preserving spoof detection, suggesting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Phonetics and Phonology Research
