Face2VoiceSync: Lightweight Face-Voice Consistency for Text-Driven Talking Face Generation
Fang Kang, Yin Cao, Haoyu Chen

TL;DR
Face2VoiceSync is a lightweight framework that generates synchronized talking face animations and speech from a face image and text, addressing face-voice mismatch issues with novel alignment and control features.
Contribution
It introduces a new framework with face-voice alignment, voice manipulation, efficient training, and a novel evaluation metric for text-driven talking face generation.
Findings
Achieves state-of-the-art visual and audio results.
Uses significantly fewer trainable parameters.
Successfully controls paralinguistic features.
Abstract
Recent studies in speech-driven talking face generation achieve promising results, but their reliance on fixed-driven speech limits further applications (e.g., face-voice mismatch). Thus, we extend the task to a more challenging setting: given a face image and text to speak, generating both talking face animation and its corresponding speeches. Accordingly, we propose a novel framework, Face2VoiceSync, with several novel contributions: 1) Voice-Face Alignment, ensuring generated voices match facial appearance; 2) Diversity \& Manipulation, enabling generated voice control over paralinguistic features space; 3) Efficient Training, using a lightweight VAE to bridge visual and audio large-pretrained models, with significantly fewer trainable parameters than existing methods; 4) New Evaluation Metric, fairly assessing the diversity and identity consistency. Experiments show Face2VoiceSync…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing
