VoxMorph: Scalable Zero-shot Voice Identity Morphing via Disentangled Embeddings

Bharath Krishnamurthy; Ajita Rattani

arXiv:2601.20883·cs.SD·January 30, 2026

VoxMorph: Scalable Zero-shot Voice Identity Morphing via Disentangled Embeddings

Bharath Krishnamurthy, Ajita Rattani

PDF

Open Access 1 Models 1 Datasets

TL;DR

VoxMorph introduces a scalable, zero-shot voice morphing framework that disentangles vocal features for high-fidelity, fine-grained identity manipulation, significantly advancing biometric security and voice synthesis technology.

Contribution

It presents a novel zero-shot voice morphing method that does not require model retraining and enables detailed control over vocal traits using disentangled embeddings.

Findings

01

Achieves 2.6x improvement in audio quality

02

Reduces intelligibility errors by 73%

03

Attains 67.8% attack success rate on speaker verification

Abstract

Morphing techniques generate artificial biometric samples that combine features from multiple individuals, allowing each contributor to be verified against a single enrolled template. While extensively studied in face recognition, this vulnerability remains largely unexplored in voice biometrics. Prior work on voice morphing is computationally expensive, non-scalable, and limited to acoustically similar identity pairs, constraining practical deployment. Moreover, existing sound-morphing methods target audio textures, music, or environmental sounds and are not transferable to voice identity manipulation. We propose VoxMorph, a zero-shot framework that produces high-fidelity voice morphs from as little as five seconds of audio per subject without model retraining. Our method disentangles vocal traits into prosody and timbre embeddings, enabling fine-grained interpolation of speaking style…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
BharathK333/VoxMorph-Models
model· 2 dl
2 dl

Datasets

BharathK333/VoxMorph-Dataset
dataset· 8.5k dl
8.5k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Face recognition and analysis