Curved Worlds, Clear Boundaries: Generalizing Speech Deepfake Detection using Hyperbolic and Spherical Geometry Spaces
Farhan Sheth, Girish, Mohd Mujtaba Akhtar, Muskaan Singh

TL;DR
This paper introduces RHYME, a geometry-aware framework that uses hyperbolic and spherical spaces to improve generalization in speech deepfake detection across diverse synthesis methods.
Contribution
It proposes a novel non-Euclidean embedding approach that captures shared structural distortions in synthetic speech, enabling better cross-paradigm detection.
Findings
RHYME outperforms existing methods in cross-paradigm detection accuracy.
Hyperbolic and spherical geometries effectively model hierarchical and angular features.
Achieves state-of-the-art results in generalizable speech deepfake detection.
Abstract
In this work, we address the challenge of generalizable audio deepfake detection (ADD) across diverse speech synthesis paradigms-including conventional text-to-speech (TTS) systems and modern diffusion or flow-matching (FM) based generators. Prior work has mostly targeted individual synthesis families and often fails to generalize across paradigms due to overfitting to generation-specific artifacts. We hypothesize that synthetic speech, irrespective of its generative origin, leaves behind shared structural distortions in the embedding space that can be aligned through geometry-aware modeling. To this end, we propose RHYME, a unified detection framework that fuses utterance-level embeddings from diverse pretrained speech encoders using non-Euclidean projections. RHYME maps representations into hyperbolic and spherical manifolds-where hyperbolic geometry excels at modeling hierarchical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis
