Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer
Zixin Yin, Xili Dai, Ling-Hao Chen, Deyu Zhou, Jianan Wang, Duomin Wang, Gang Yu, Lionel M. Ni, Lei Zhang, Heung-Yeung Shum

TL;DR
ColorCtrl is a training-free method leveraging Multi-Modal Diffusion Transformers to achieve precise, consistent, and region-specific color editing in images and videos guided by text prompts.
Contribution
We introduce ColorCtrl, a novel training-free color editing technique that manipulates attention maps in diffusion transformers for accurate, consistent, and controllable color modifications.
Findings
Outperforms existing training-free methods in quality and consistency
Achieves state-of-the-art results on SD3 and FLUX.1-dev datasets
Maintains temporal coherence and stability in video editing
Abstract
Text-guided color editing in images and videos is a fundamental yet unsolved problem, requiring fine-grained manipulation of color attributes, including albedo, light source color, and ambient lighting, while preserving physical consistency in geometry, material properties, and light-matter interactions. Existing training-free methods offer broad applicability across editing tasks but struggle with precise color control and often introduce visual inconsistency in both edited and non-edited regions. In this work, we present ColorCtrl, a training-free color editing method that leverages the attention mechanisms of modern Multi-Modal Diffusion Transformers (MM-DiT). By disentangling structure and color through targeted manipulation of attention maps and value tokens, our method enables accurate and consistent color editing, along with word-level control of attribute intensity. Our method…
Peer Reviews
Decision·ICLR 2026 Poster
- The proposed method is dedicatedly designed for color editing and exclude other factors, achieving clean color editing results with faithfully maintained content. - The proposed method and benchmark highlights the importance of producing reasonable color edit with similar lighting etc. environmental conditions, over the traditional standard semantic CLIP alignment, making the output results more realistic and natural.
- [Major] The only editing quality metric is still CLIP similarity, which doesn't align with the claim that CLIP usually leads to over saturated preference as it lacks details. The major upgrade of metrics only focus on the structure preservation. Given the focus of the paper, some improved editing metrics are necessary to solidate the evaluation. For example, at least aesthetics/harmony can be tested to show the edited colors fit in the environment well. Maybe other color spaces like HSV could
1. Originality: The idea of using MM-DiT attention maps for color-specific editing is novel in the context of training-free pipelines.The integration of word-level control adds granularity not commonly seen in prior works. 2. Quality: The method is well-implemented and evaluated across multiple domains (images, videos, instructions). Results show high semantic consistency and localized edits, outperforming several baselines. 3. Clarity: The paper is generally well-written, with clear motivation
1. Limited Novelty: While the use of MM-DiT is novel for color editing, similar pipelines have applied attention-based editing in other contexts. TextCrafter also leverages attention maps from MM-DiT to extract semantic masks and reweight attention for image editing. Add-It also leverages attention maps from MM-DiT to extract semantic masks. 2. Insufficient Analysis of Key Components: (1) The mask extraction process is underexplained: How are attention maps selected? What thresholding strategy
* Originality: The paper adapts attention-control editing to MM-DiT with a clear decomposition of attention quadrants: vision-to-vision for structure preservation, vision-to-text for mask extraction, and text-to-vision for controllable attribute strength. This differs from U-Net cross-attention methods and prior MM-DiT controls (e.g., DiTCtrl) by operating directly on attention maps and value-token routing without training. * Quality: The mechanism is well specified: two-branch unrolling wit
* Masking and subject detection reliance: The evaluation and parts of the pipeline hinge on subject keywords and a fixed attention-threshold ($\epsilon=0.1$) for mask extraction; robustness to threshold choice, ambiguous subject words, or multi-object scenes is not deeply analyzed. * Claims versus limitations: The paper acknowledges failures when the base model mislocalizes targets or confuses attributes (e.g., trees or lipstick casing). More systematic characterization of such failure modes—
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Computer Graphics and Visualization Techniques · Image Enhancement Techniques
