Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models
Alessandro Pianese, Davide Cozzolino, Giovanni Poggi, Luisa Verdoliva

TL;DR
This paper introduces a training-free deepfake voice detection method leveraging large-scale pre-trained models, reformulating the problem as speaker verification to improve generalization across diverse datasets without needing fake speech samples for training.
Contribution
The study demonstrates that using large pre-trained models in a speaker verification framework enables effective deepfake detection without training on fake data, enhancing out-of-distribution generalization.
Findings
Achieves high accuracy on multiple datasets.
Outperforms supervised methods on out-of-distribution data.
Requires only limited voice fragments at detection time.
Abstract
Generalization is a main issue for current audio deepfake detectors, which struggle to provide reliable results on out-of-distribution data. Given the speed at which more and more accurate synthesis methods are developed, it is very important to design techniques that work well also on data they were not trained for. In this paper we study the potential of large-scale pre-trained models for audio deepfake detection, with special focus on generalization ability. To this end, the detection problem is reformulated in a speaker verification framework and fake audios are exposed by the mismatch between the voice sample under test and the voice of the claimed identity. With this paradigm, no fake speech sample is necessary in training, cutting off any link with the generation method at the root, and ensuring full generalization ability. Features are extracted by general-purpose large…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSparse Evolutionary Training · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Focus
