ConsistEdit: Highly Consistent and Precise Training-free Visual Editing
Zixin Yin, Ling-Hao Chen, Lionel Ni, Xili Dai

TL;DR
ConsistEdit introduces a training-free, attention control method for MM-DiT that achieves highly consistent, precise, and flexible visual editing across images and videos, outperforming prior approaches in multi-round and multi-region editing.
Contribution
The paper presents ConsistEdit, a novel attention control technique tailored for MM-DiT, enabling reliable, fine-grained, and multi-step visual editing without manual tuning.
Findings
Achieves state-of-the-art performance in image and video editing tasks.
Supports multi-round and multi-region editing with high consistency.
Enables progressive control of structural consistency.
Abstract
Recent advances in training-free attention control methods have enabled flexible and efficient text-guided editing capabilities for existing generation models. However, current approaches struggle to simultaneously deliver strong editing strength while preserving consistency with the source. This limitation becomes particularly critical in multi-round and video editing, where visual errors can accumulate over time. Moreover, most existing methods enforce global consistency, which limits their ability to modify individual attributes such as texture while preserving others, thereby hindering fine-grained editing. Recently, the architectural shift from U-Net to MM-DiT has brought significant improvements in generative performance and introduced a novel mechanism for integrating text and vision modalities. These advancements pave the way for overcoming challenges that previous methods…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
