ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing

Lingen Li; Guangzhi Wang; Zhaoyang Zhang; Yaowei Li; Xiaoyu Li; Qi Dou; Jinwei Gu; Tianfan Xue; Ying Shan

arXiv:2508.10881·cs.CV·August 15, 2025

ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing

Lingen Li, Guangzhi Wang, Zhaoyang Zhang, Yaowei Li, Xiaoyu Li, Qi Dou, Jinwei Gu, Tianfan Xue, Ying Shan

PDF

2 Models 3 Reviews

TL;DR

ToonComposer is a unified generative model that streamlines cartoon production by combining inbetweening and colorization into a single stage, reducing manual effort and enhancing control with sparse inputs.

Contribution

It introduces a novel unified framework with sparse sketch injection and domain adaptation techniques for efficient, flexible cartoon post-production.

Findings

01

Outperforms existing methods in visual quality and motion consistency.

02

Requires minimal input, such as a single sketch and reference frame.

03

Supports multiple sketches for precise motion control.

Abstract

Traditional cartoon and anime production involves keyframing, inbetweening, and colorization stages, which require intensive manual effort. Despite recent advances in AI, existing methods often handle these stages separately, leading to error accumulation and artifacts. For instance, inbetweening approaches struggle with large motions, while colorization methods require dense per-frame sketches. To address this, we introduce ToonComposer, a generative model that unifies inbetweening and colorization into a single post-keyframing stage. ToonComposer employs a sparse sketch injection mechanism to provide precise control using keyframe sketches. Additionally, it uses a cartoon adaptation method with the spatial low-rank adapter to tailor a modern video foundation model to the cartoon domain while keeping its temporal prior intact. Requiring as few as a single sketch and a colored reference…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1. The sparse sketch injection mechanism enables precise temporal motion control with minimal sketch input. The inclusion of region-wise control further increases usability and flexibility, allowing users to input incomplete sketches for targeted generation. 2. The introduction of the spatial low-rank adapter (SLRA) is a significant technical contribution. Ablation experiments show that SLRA consistently outperforms standard LoRA and other adaptation baselines by effectively tailoring only the s

Weaknesses

1. This is not the first DiT-based Anime in-betweening and colorization paper. SketchColour[1] and AnimeColor[2] were also based on DiT-based models. Though they may be concurrent works, the authors should still not claim that they are the first. Moreover, I personally don't recognize transferring a similar technique from UNet to DiT as a contribution. 2. The idea of only optimize spatial attention is not novel either. ToonCrafter[3] already found the fact that finetuning spatial layers only wor

Reviewer 02Rating 6Confidence 4

Strengths

- The core contribution is the unification of sketch colorization and interpolation into a single, cohesive task. This approach has the potential to significantly streamline the anime production workflow. - The curation of the large-scale PKData dataset and the development of the PKBench benchmark, which uniquely includes human-drawn sketches, are valuable resources. They facilitate robust training and a more rigorous evaluation of cartoon generation models than previously possible. - The pape

Weaknesses

- The paper uses sparse sketches and color references as simultaneous control conditions. However, the concept of using multiple controls is not entirely novel and has been explored in prior works (e.g., LayerAnimate[1]). The authors should more explicitly discuss the unique advantages of their specific approach in the animation workflow. - Temporal Alignment: The VAE in Wan employs a 4x temporal compression, meaning each latent token encapsulates information from multiple frames. The paper sta

Reviewer 03Rating 2Confidence 5

Strengths

* Introduces a single-stage post-keyframing formulation that unifies inbetweening and colorization—reducing the error accumulation inherent to two-stage pipelines and enabling control from as little as one sketch + one colored frame. * Proposes sparse sketch injection with position-aware residual and positional-encoding mapping (sketch tokens appended to the latent token sequence with remapped RoPE to target specific frame indices); this is a clean, DiT-native control mechanism distinct from ch

Weaknesses

* **Using DiT is not novel:** Just changing unet-based diffusion with a DiT-based model is not a significant gain, and I can't see any specification. I don't know why the author bolded it. Lines 089-099. * **SLRA is not novel:** Using a low-rank adapter for temporal or spatial adaptation is not novel, and many studies have used it for domain adaptation. * **Sketch faithfulness is under-measured:** Current metrics (LPIPS/DISTS/CLIP) correlate with perceptual quality but not how well frames obey

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.