TalkCLIP: Talking Head Generation with Text-Guided Expressive Speaking Styles
Yifeng Ma, Suzhen Wang, Yu Ding, Bowen Ma, Tangjie Lv, Changjie Fan,, Zhipeng Hu, Zhidong Deng, Xin Yu

TL;DR
TalkCLIP is a novel framework that generates realistic talking head videos with expressions guided solely by natural language descriptions, enabling flexible and convenient expression control without extra reference videos.
Contribution
It introduces a CLIP-based style encoder and a new text-video paired dataset to enable text-guided expression generation in talking head videos, including unseen expressions.
Findings
Achieves photo-realistic talking head generation with vivid expressions
Can infer and edit expressions based on unseen text descriptions
Demonstrates advanced control over expression intensity and style
Abstract
Audio-driven talking head generation has drawn growing attention. To produce talking head videos with desired facial expressions, previous methods rely on extra reference videos to provide expression information, which may be difficult to find and hence limits their usage. In this work, we propose TalkCLIP, a framework that can generate talking heads where the expressions are specified by natural language, hence allowing for specifying expressions more conveniently. To model the mapping from text to expressions, we first construct a text-video paired talking head dataset where each video has diverse text descriptions that depict both coarse-grained emotions and fine-grained facial movements. Leveraging the proposed dataset, we introduce a CLIP-based style encoder that projects natural language-based descriptions to the representations of expressions. TalkCLIP can even infer expressions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Human Motion and Animation
MethodsContrastive Language-Image Pre-training
