V2C: Visual Voice Cloning
Qi Chen, Yuanqing Li, Yuankai Qi, Jiaqiu Zhou, Mingkui Tan, Qi Wu

TL;DR
This paper introduces Visual Voice Cloning (V2C), a new task that synthesizes speech from text with specified voice and emotion, supported by a new dataset and evaluation metrics, addressing limitations of traditional voice cloning in emotional and visual contexts.
Contribution
The paper proposes the V2C task, creates the V2C-Animation dataset, and develops a new evaluation metric, advancing research in emotionally and visually conditioned speech synthesis.
Findings
Existing SOTA VC methods perform poorly on V2C tasks.
The V2C-Animation dataset contains over 10,000 animated clips with diverse genres and emotions.
The proposed evaluation metric effectively measures speech similarity in V2C applications.
Abstract
Existing Voice Cloning (VC) tasks aim to convert a paragraph text to a speech with desired voice specified by a reference audio. This has significantly boosted the development of artificial speech applications. However, there also exist many scenarios that cannot be well reflected by these VC tasks, such as movie dubbing, which requires the speech to be with emotions consistent with the movie plots. To fill this gap, in this work we propose a new task named Visual Voice Cloning (V2C), which seeks to convert a paragraph of text to a speech with both desired voice specified by a reference audio and desired emotion specified by a reference video. To facilitate research in this field, we construct a dataset, V2C-Animation, and propose a strong baseline based on existing state-of-the-art (SoTA) VC techniques. Our dataset contains 10,217 animated movie clips covering a large variety of genres…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
