RVCBench: Benchmarking the Robustness of Voice Cloning Across Modern Audio Generation Models

Xinting Liao; Ruinan Jin; Hanlin Yu; Deval Pandya; Xiaoxiao Li

arXiv:2602.00443·cs.SD·February 3, 2026

RVCBench: Benchmarking the Robustness of Voice Cloning Across Modern Audio Generation Models

Xinting Liao, Ruinan Jin, Hanlin Yu, Deval Pandya, Xiaoxiao Li

PDF

Open Access

TL;DR

RVCBench is a comprehensive benchmark that evaluates the robustness of modern voice cloning models across various realistic challenges, revealing significant vulnerabilities and guiding future improvements.

Contribution

This paper introduces RVCBench, the first extensive benchmark for assessing robustness of voice cloning models under real-world conditions, covering multiple tasks and models.

Findings

01

Substantial robustness gaps in current VC models.

02

Performance drops under input shifts and post-processing.

03

Long-context and cross-lingual scenarios reveal stability issues.

Abstract

Modern voice cloning (VC) can synthesize speech that closely matches a target speaker from only seconds of reference audio, enabling applications such as personalized speech interfaces and dubbing. In practical deployments, modern audio generation models inevitably encounter noisy reference audios, imperfect text prompts, and diverse downstream processing, which can significantly hurt robustness. Despite rapid progress in VC driven by autoregressive codec-token language models and diffusion-based models, robustness under realistic deployment shifts remains underexplored. This paper introduces RVCBench, a comprehensive benchmark that evaluates Robustness in VC across the full generation pipeline, including input variation, generation challenges, output post-processing, and adversarial perturbations, covering 10 robustness tasks, 225 speakers, 14,370 utterances, and 11 representative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Generative Adversarial Networks and Image Synthesis · Topic Modeling