TL;DR
ToonComposer is a unified generative model that streamlines cartoon production by combining inbetweening and colorization into a single stage, reducing manual effort and enhancing control with sparse inputs.
Contribution
It introduces a novel unified framework with sparse sketch injection and domain adaptation techniques for efficient, flexible cartoon post-production.
Findings
Outperforms existing methods in visual quality and motion consistency.
Requires minimal input, such as a single sketch and reference frame.
Supports multiple sketches for precise motion control.
Abstract
Traditional cartoon and anime production involves keyframing, inbetweening, and colorization stages, which require intensive manual effort. Despite recent advances in AI, existing methods often handle these stages separately, leading to error accumulation and artifacts. For instance, inbetweening approaches struggle with large motions, while colorization methods require dense per-frame sketches. To address this, we introduce ToonComposer, a generative model that unifies inbetweening and colorization into a single post-keyframing stage. ToonComposer employs a sparse sketch injection mechanism to provide precise control using keyframe sketches. Additionally, it uses a cartoon adaptation method with the spatial low-rank adapter to tailor a modern video foundation model to the cartoon domain while keeping its temporal prior intact. Requiring as few as a single sketch and a colored reference…
Peer Reviews
Decision·ICLR 2026 Poster
1. The sparse sketch injection mechanism enables precise temporal motion control with minimal sketch input. The inclusion of region-wise control further increases usability and flexibility, allowing users to input incomplete sketches for targeted generation. 2. The introduction of the spatial low-rank adapter (SLRA) is a significant technical contribution. Ablation experiments show that SLRA consistently outperforms standard LoRA and other adaptation baselines by effectively tailoring only the s
1. This is not the first DiT-based Anime in-betweening and colorization paper. SketchColour[1] and AnimeColor[2] were also based on DiT-based models. Though they may be concurrent works, the authors should still not claim that they are the first. Moreover, I personally don't recognize transferring a similar technique from UNet to DiT as a contribution. 2. The idea of only optimize spatial attention is not novel either. ToonCrafter[3] already found the fact that finetuning spatial layers only wor
- The core contribution is the unification of sketch colorization and interpolation into a single, cohesive task. This approach has the potential to significantly streamline the anime production workflow. - The curation of the large-scale PKData dataset and the development of the PKBench benchmark, which uniquely includes human-drawn sketches, are valuable resources. They facilitate robust training and a more rigorous evaluation of cartoon generation models than previously possible. - The pape
- The paper uses sparse sketches and color references as simultaneous control conditions. However, the concept of using multiple controls is not entirely novel and has been explored in prior works (e.g., LayerAnimate[1]). The authors should more explicitly discuss the unique advantages of their specific approach in the animation workflow. - Temporal Alignment: The VAE in Wan employs a 4x temporal compression, meaning each latent token encapsulates information from multiple frames. The paper sta
* Introduces a single-stage post-keyframing formulation that unifies inbetweening and colorization—reducing the error accumulation inherent to two-stage pipelines and enabling control from as little as one sketch + one colored frame. * Proposes sparse sketch injection with position-aware residual and positional-encoding mapping (sketch tokens appended to the latent token sequence with remapped RoPE to target specific frame indices); this is a clean, DiT-native control mechanism distinct from ch
* **Using DiT is not novel:** Just changing unet-based diffusion with a DiT-based model is not a significant gain, and I can't see any specification. I don't know why the author bolded it. Lines 089-099. * **SLRA is not novel:** Using a low-rank adapter for temporal or spatial adaptation is not novel, and many studies have used it for domain adaptation. * **Sketch faithfulness is under-measured:** Current metrics (LPIPS/DISTS/CLIP) correlate with perceptual quality but not how well frames obey
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
