ConsistEdit: Highly Consistent and Precise Training-free Visual Editing

Zixin Yin; Ling-Hao Chen; Lionel Ni; Xili Dai

arXiv:2510.17803·cs.CV·October 21, 2025

ConsistEdit: Highly Consistent and Precise Training-free Visual Editing

Zixin Yin, Ling-Hao Chen, Lionel Ni, Xili Dai

PDF

Open Access

TL;DR

ConsistEdit introduces a training-free, attention control method for MM-DiT that achieves highly consistent, precise, and flexible visual editing across images and videos, outperforming prior approaches in multi-round and multi-region editing.

Contribution

The paper presents ConsistEdit, a novel attention control technique tailored for MM-DiT, enabling reliable, fine-grained, and multi-step visual editing without manual tuning.

Findings

01

Achieves state-of-the-art performance in image and video editing tasks.

02

Supports multi-round and multi-region editing with high consistency.

03

Enables progressive control of structural consistency.

Abstract

Recent advances in training-free attention control methods have enabled flexible and efficient text-guided editing capabilities for existing generation models. However, current approaches struggle to simultaneously deliver strong editing strength while preserving consistency with the source. This limitation becomes particularly critical in multi-round and video editing, where visual errors can accumulate over time. Moreover, most existing methods enforce global consistency, which limits their ability to modify individual attributes such as texture while preserving others, thereby hindering fine-grained editing. Recently, the architectural shift from U-Net to MM-DiT has brought significant improvements in generative performance and introduced a novel mechanism for integrating text and vision modalities. These advancements pave the way for overcoming challenges that previous methods…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis