VoxMorph: Scalable Zero-shot Voice Identity Morphing via Disentangled Embeddings
Bharath Krishnamurthy, Ajita Rattani

TL;DR
VoxMorph introduces a scalable, zero-shot voice morphing framework that disentangles vocal features for high-fidelity, fine-grained identity manipulation, significantly advancing biometric security and voice synthesis technology.
Contribution
It presents a novel zero-shot voice morphing method that does not require model retraining and enables detailed control over vocal traits using disentangled embeddings.
Findings
Achieves 2.6x improvement in audio quality
Reduces intelligibility errors by 73%
Attains 67.8% attack success rate on speaker verification
Abstract
Morphing techniques generate artificial biometric samples that combine features from multiple individuals, allowing each contributor to be verified against a single enrolled template. While extensively studied in face recognition, this vulnerability remains largely unexplored in voice biometrics. Prior work on voice morphing is computationally expensive, non-scalable, and limited to acoustically similar identity pairs, constraining practical deployment. Moreover, existing sound-morphing methods target audio textures, music, or environmental sounds and are not transferable to voice identity manipulation. We propose VoxMorph, a zero-shot framework that produces high-fidelity voice morphs from as little as five seconds of audio per subject without model retraining. Our method disentangles vocal traits into prosody and timbre embeddings, enabling fine-grained interpolation of speaking style…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Face recognition and analysis
