Face2VoiceSync: Lightweight Face-Voice Consistency for Text-Driven Talking Face Generation

Fang Kang; Yin Cao; Haoyu Chen

arXiv:2507.19225·cs.SD·July 28, 2025

Face2VoiceSync: Lightweight Face-Voice Consistency for Text-Driven Talking Face Generation

Fang Kang, Yin Cao, Haoyu Chen

PDF

Open Access

TL;DR

Face2VoiceSync is a lightweight framework that generates synchronized talking face animations and speech from a face image and text, addressing face-voice mismatch issues with novel alignment and control features.

Contribution

It introduces a new framework with face-voice alignment, voice manipulation, efficient training, and a novel evaluation metric for text-driven talking face generation.

Findings

01

Achieves state-of-the-art visual and audio results.

02

Uses significantly fewer trainable parameters.

03

Successfully controls paralinguistic features.

Abstract

Recent studies in speech-driven talking face generation achieve promising results, but their reliance on fixed-driven speech limits further applications (e.g., face-voice mismatch). Thus, we extend the task to a more challenging setting: given a face image and text to speak, generating both talking face animation and its corresponding speeches. Accordingly, we propose a novel framework, Face2VoiceSync, with several novel contributions: 1) Voice-Face Alignment, ensuring generated voices match facial appearance; 2) Diversity \& Manipulation, enabling generated voice control over paralinguistic features space; 3) Efficient Training, using a lightweight VAE to bridge visual and audio large-pretrained models, with significantly fewer trainable parameters than existing methods; 4) New Evaluation Metric, fairly assessing the diversity and identity consistency. Experiments show Face2VoiceSync…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing