InstanceAnimator: Multi-Instance Sketch Video Colorization
Yinhan Zhang, Yue Ma, Bingyuan Wang, Kunyu Feng, Yeying Jin, Qifeng Chen, Anyi Rao, Zeyu Wang

TL;DR
InstanceAnimator introduces a diffusion transformer framework that significantly improves multi-instance sketch video colorization by enhancing user control, instance alignment, and detail fidelity through novel mechanisms.
Contribution
The paper presents three key innovations—Canvas Guidance Condition, Instance Matching Mechanism, and Adaptive Decoupled Control Module—that collectively advance multi-instance sketch video colorization.
Findings
Achieves superior multi-instance colorization quality.
Enhances user control and flexibility.
Ensures high instance consistency and detail fidelity.
Abstract
We propose InstanceAnimator, a novel Diffusion Transformer framework for multi-instance sketch video colorization. Existing methods suffer from three core limitations: inflexible user control due to heavy reliance on single reference frames, poor instance controllability leading to misalignment in multi-character scenarios, and degraded detail fidelity in fine-grained regions. To address these challenges, we introduce three corresponding innovations. First, a Canvas Guidance Condition eliminates workflow fragmentation by allowing free placement of reference elements and background, enabling unprecedented user flexibility. Second, an Instance Matching Mechanism resolves misalignment by integrating instance features with the sketches, ensuring precise control over multiple characters. Third, an Adaptive Decoupled Control Module enhances detail fidelity by injecting semantic features from…
Peer Reviews
Decision·Submitted to ICLR 2026
- The problem setting—multi-instance sketch video colorization—is interesting and relevant for creative AI and animation generation. - The overall pipeline is clearly presented, with detailed ablations showing the contribution of each module. - The qualitative results demonstrate visually appealing outputs with good color consistency and controllability.
- The proposed Instance Matching mechanism seems to only establish random associations between sketches and instances, without any explicit spatial alignment. While the Canvas Guidance provides a weak positional prior, it does not guarantee that instances placed on the canvas will appear at the intended locations in the generated sequence. What if users need to swap two characters? This limits the controllability for professional animation use cases. - The quantitative improvement over strong b
The paper demonstrates a multi-instance colorization method in sketch videos, improving single-frame reference colorization to enable flexible, instance-based control that aligns with anime production workflows. The technical design is reasonable, including the canvas guidance and decoupled control module that effectively address identified gaps in DiT-based models. The motivation, method descriptions, and figures are easy to follow. This work has values to reduce the extensive human labor in an
1. While the claims are generally supported by numerical results, the sketch fidelity remains unsatisfactory in some samples. For example, in the teaser figure, we can clearly infer that the model output does not follow the sketches (2nd row: Chihiro's face does not follow the sketch in the 3rd frame; 4th row: the girl's mouth does not follow the sketch in 3-5th frames). Since sketch fidelity is essential in video colorization and anime production, this defect is unsatisfactory in a video sketch
- The primary contribution is shifting the animation conditioning from complete frames to the instance level. This approach is more user-friendly and has the potential to simplify the practical animation workflow. - The paper is easy to follow. - The experimental results effectively demonstrate that InstanceAnimator can produce high-fidelity video results.
- The description of "instance matching" in Section 3.3 is ambiguous. Equation 2 introduces instance-specific latent features ($Z_{\text{inst}}^i$), but the paper fails to explain how these features are utilized or injected into the network. The methodology is difficult to understand without Figure 4, indicating a need for significant improvement in the clarity of the writing. - Insufficient Instance Correspondence Mechanism: A more critical issue is the lack of a clear explanation for how insta
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Computer Graphics and Visualization Techniques
