CLIP-Actor: Text-Driven Recommendation and Stylization for Animating Human Meshes
Kim Youwang, Kim Ji-Yeon, Tae-Hyun Oh

TL;DR
CLIP-Actor is a novel system that generates human mesh animations from text prompts by recommending motions and applying neural style optimization, ensuring temporally consistent and realistic results.
Contribution
It introduces a zero-shot neural style optimization method and a text-driven motion recommendation system for 3D human meshes, addressing limitations of prior pose-dependent approaches.
Findings
Produces plausible, human-recognizable 3D animations from text
Ensures temporal consistency and pose-agnostic stylization
Leverages multi-frame data for stable optimization
Abstract
We propose CLIP-Actor, a text-driven motion recommendation and neural mesh stylization system for human mesh animation. CLIP-Actor animates a 3D human mesh to conform to a text prompt by recommending a motion sequence and optimizing mesh style attributes. We build a text-driven human motion recommendation system by leveraging a large-scale human motion dataset with language labels. Given a natural language prompt, CLIP-Actor suggests a text-conforming human motion in a coarse-to-fine manner. Then, our novel zero-shot neural style optimization detailizes and texturizes the recommended mesh sequence to conform to the prompt in a temporally-consistent and pose-agnostic manner. This is distinctive in that prior work fails to generate plausible results when the pose of an artist-designed mesh does not conform to the text from the beginning. We further propose the spatio-temporal view…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Human Pose and Action Recognition · 3D Shape Modeling and Analysis
MethodsContrastive Language-Image Pre-training
