PTTA: A Pure Text-to-Animation Framework for High-Quality Creation
Ruiqi Chen, Kaitong Cai, Yijia Fan, Keze Wang

TL;DR
PTTA is a novel framework that converts textual descriptions into high-quality animations by fine-tuning a pretrained text-to-video model with a specialized animation dataset, addressing limitations of existing models.
Contribution
It introduces a dedicated animation dataset and adapts a pretrained text-to-video model for superior animation synthesis from text descriptions.
Findings
Outperforms baseline models in animation quality
Successfully adapts text-to-video models for animation creation
Demonstrates high-quality animation generation from text
Abstract
Traditional animation production involves complex pipelines and significant manual labor cost. While recent video generation models such as Sora, Kling, and CogVideoX achieve impressive results on natural video synthesis, they exhibit notable limitations when applied to animation generation. Recent efforts, such as AniSora, demonstrate promising performance by fine-tuning image-to-video models for animation styles, yet analogous exploration in the text-to-video setting remains limited. In this work, we present PTTA, a pure text-to-animation framework for high-quality animation creation. We first construct a small-scale but high-quality paired dataset of animation videos and textual descriptions. Building upon the pretrained text-to-video model HunyuanVideo, we perform fine-tuning to adapt it to animation-style generation. Extensive visual evaluations across multiple dimensions show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · 3D Shape Modeling and Analysis
