PTTA: A Pure Text-to-Animation Framework for High-Quality Creation

Ruiqi Chen; Kaitong Cai; Yijia Fan; Keze Wang

arXiv:2512.18614·cs.CV·December 23, 2025

PTTA: A Pure Text-to-Animation Framework for High-Quality Creation

Ruiqi Chen, Kaitong Cai, Yijia Fan, Keze Wang

PDF

Open Access

TL;DR

PTTA is a novel framework that converts textual descriptions into high-quality animations by fine-tuning a pretrained text-to-video model with a specialized animation dataset, addressing limitations of existing models.

Contribution

It introduces a dedicated animation dataset and adapts a pretrained text-to-video model for superior animation synthesis from text descriptions.

Findings

01

Outperforms baseline models in animation quality

02

Successfully adapts text-to-video models for animation creation

03

Demonstrates high-quality animation generation from text

Abstract

Traditional animation production involves complex pipelines and significant manual labor cost. While recent video generation models such as Sora, Kling, and CogVideoX achieve impressive results on natural video synthesis, they exhibit notable limitations when applied to animation generation. Recent efforts, such as AniSora, demonstrate promising performance by fine-tuning image-to-video models for animation styles, yet analogous exploration in the text-to-video setting remains limited. In this work, we present PTTA, a pure text-to-animation framework for high-quality animation creation. We first construct a small-scale but high-quality paired dataset of animation videos and textual descriptions. Building upon the pretrained text-to-video model HunyuanVideo, we perform fine-tuning to adapt it to animation-style generation. Extensive visual evaluations across multiple dimensions show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · 3D Shape Modeling and Analysis