TalkCLIP: Talking Head Generation with Text-Guided Expressive Speaking   Styles

Yifeng Ma; Suzhen Wang; Yu Ding; Bowen Ma; Tangjie Lv; Changjie Fan,; Zhipeng Hu; Zhidong Deng; Xin Yu

arXiv:2304.00334·cs.CV·August 13, 2024·6 cites

TalkCLIP: Talking Head Generation with Text-Guided Expressive Speaking Styles

Yifeng Ma, Suzhen Wang, Yu Ding, Bowen Ma, Tangjie Lv, Changjie Fan,, Zhipeng Hu, Zhidong Deng, Xin Yu

PDF

Open Access

TL;DR

TalkCLIP is a novel framework that generates realistic talking head videos with expressions guided solely by natural language descriptions, enabling flexible and convenient expression control without extra reference videos.

Contribution

It introduces a CLIP-based style encoder and a new text-video paired dataset to enable text-guided expression generation in talking head videos, including unseen expressions.

Findings

01

Achieves photo-realistic talking head generation with vivid expressions

02

Can infer and edit expressions based on unseen text descriptions

03

Demonstrates advanced control over expression intensity and style

Abstract

Audio-driven talking head generation has drawn growing attention. To produce talking head videos with desired facial expressions, previous methods rely on extra reference videos to provide expression information, which may be difficult to find and hence limits their usage. In this work, we propose TalkCLIP, a framework that can generate talking heads where the expressions are specified by natural language, hence allowing for specifying expressions more conveniently. To model the mapping from text to expressions, we first construct a text-video paired talking head dataset where each video has diverse text descriptions that depict both coarse-grained emotions and fine-grained facial movements. Leveraging the proposed dataset, we introduce a CLIP-based style encoder that projects natural language-based descriptions to the representations of expressions. TalkCLIP can even infer expressions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Human Motion and Animation

MethodsContrastive Language-Image Pre-training