ExpCLIP: Bridging Text and Facial Expressions via Semantic Alignment
Yicheng Zhong, Huawei Wei, Peiji Yang, Zhisheng Wang

TL;DR
ExpCLIP introduces a flexible, language-driven approach for controlling facial expressions in speech-driven animation by aligning text prompts with facial styles using a novel dataset and CLIP-based model.
Contribution
The paper presents TEAD, an automatic annotation method with LLM support, and ExpCLIP, a CLIP-based model for semantic alignment of text and facial expressions, enabling style control in animations.
Findings
Supports arbitrary style control via natural language prompts.
Achieves semantically aligned text and facial expression embeddings.
Enhances diversity and expressiveness in facial animation styles.
Abstract
The objective of stylized speech-driven facial animation is to create animations that encapsulate specific emotional expressions. Existing methods often depend on pre-established emotional labels or facial expression templates, which may limit the necessary flexibility for accurately conveying user intent. In this research, we introduce a technique that enables the control of arbitrary styles by leveraging natural language as emotion prompts. This technique presents benefits in terms of both flexibility and user-friendliness. To realize this objective, we initially construct a Text-Expression Alignment Dataset (TEAD), wherein each facial expression is paired with several prompt-like descriptions.We propose an innovative automatic annotation method, supported by Large Language Models (LLMs), to expedite the dataset construction, thereby eliminating the substantial expense of manual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsHuman Motion and Animation · Face recognition and analysis
