Alethia: A Foundational Encoder for Voice Deepfakes
Yi Zhu, Brahmi Dwivedi, Jayaram Raghuram, Surya Koppisetti

TL;DR
Alethia is a novel foundational audio encoder that improves voice deepfake detection and localization across multiple tasks by leveraging a unique pretraining approach, outperforming existing speech foundation models in robustness and generalization.
Contribution
Proposes Alethia, the first foundational encoder for voice deepfake tasks, using a new pretraining recipe combining embedding prediction and spectrogram reconstruction.
Findings
Alethia outperforms state-of-the-art SFMs on 5 tasks with 56 datasets.
Alethia shows superior robustness to real-world perturbations.
Alethia demonstrates strong zero-shot generalization to unseen domains.
Abstract
Existing voice deepfake detection and localization models rely heavily on representations extracted from speech foundation models (SFMs). However, downstream finetuning has now reached a state of diminishing returns. In this paper, we shift the focus to pretraining and propose a novel recipe that combines bottleneck masked embedding prediction with flow-matching based spectrogram reconstruction. The outcome, Alethia, is the first foundational audio encoder for various voice deepfake detection and localization tasks. We evaluate on different tasks with benchmark datasets, and note Alethia significantly outperforms state-of-the-art SFMs with superior robustness to real-world perturbations and zero-shot generalization to unseen domains (e.g., singing deepfakes). We also demonstrate the limitation of discrete targets in masked token prediction, and show the importance of continuous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
