CannyEdit: Selective Canny Control and Dual-Prompt Guidance for Training-Free Image Editing
Weiyan Xie, Han Gao, Didan Deng, Kaican Li, April Hua Liu, Yongxiang Huang, Nevin L. Zhang

TL;DR
CannyEdit is a training-free image editing framework that uses selective structural guidance and dual-prompt strategies to improve control, fidelity, and seamlessness in region-specific edits, even with minimal input.
Contribution
It introduces a novel training-free approach combining Selective Canny Control and Dual-Prompt Guidance for improved image editing control and flexibility.
Findings
Outperforms existing region-based editing methods in text adherence and seamlessness.
Effective with rough masks or single-point hints for additional tasks.
Seamlessly integrates with vision-language models for complex instruction-based editing.
Abstract
Recent advances in text-to-image (T2I) models have enabled training-free regional image editing by leveraging the generative priors of foundation models. However, existing methods struggle to balance text adherence in edited regions, context fidelity in unedited areas, and seamless integration of edits. We introduce CannyEdit, a novel training-free framework that addresses this trilemma through two key innovations. First, Selective Canny Control applies structural guidance from a Canny ControlNet only to the unedited regions, preserving the original image's details while allowing for precise, text-driven changes in the specified editable area. Second, Dual-Prompt Guidance utilizes both a local prompt for the specific edit and a global prompt for overall scene coherence. Through this synergistic approach, these components enable controllable local editing for object addition,…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
### Originality 1. Introduces a clear and well-motivated training-free editing framework addressing the editability-fidelity-seamlessness trade-off. 2. Overall originality is moderate: innovation lies more in integration and careful design than in novel algorithms. ### Quality 1. Strong empirical validation on both mask-based and instruction-based setups. 2. Selective Canny Control effectively preserves background structure while allowing flexible local edits. 3. Dual-Prompt Guidance improves t
1. Limited Novelty of Mechanisms The two key modules—selective structural control and multi-prompt attention—mainly extend existing ControlNet and attention-masking strategies rather than introducing fundamentally new formulations. In particular, **Selective Canny Control** is conceptually similar to the edge-based structural guidance used in **MagicQuill[1] (Sec. 3.1, Para. 1)**, but the paper does not explicitly clarify how it differs. 2. Efficiency Unclear Although the method is “training-fr
- The method requires no fine-tuning of the base diffusion model (i.e., FLUX.1-dev) and works with existing Canny-based ControlNets, making integration straightforward and model-agnostic. - During denoising, Canny ControlNet feature maps are injected only into non-target pixels, stabilizing layout and preventing unintended changes in unedited regions. - Dual (local/global) prompts and a VLM+SAM pipeline that converts point hints into accurate masks enable precise control and natural extension to
**1. Minor Novelty of CannyEdit.** Despite using provided editing masks (or VLM+SAM–refined masks), background preservation lags far behind prior art: e.g., LPIPS (Appendix Tab. 4) shows KV-Edit 9.92 vs. CannyEdit 26.38, indicating that simply mixing ControlNet features into non-target pixels is insufficient to protect unedited regions. If the method’s core claim is “Selective Canny Control preserves the original structure,” then background-fidelity metrics must be strong in the main tables; mov
- The proposed editing method can precisely locate editing regions, support more flexible editing operations, and deliver highly faithful generation results. - The designed approach effectively preserves the unedited regions of the image. - The paper is clearly written, with well-presented comparative results and professionally crafted figures.
- The method appears to be heavily engineered, with the overall image generation process resembling a combination of null-text inversion, ControlNet, and FLUX. - Given the involvement of multiple models, the inference speed requires clarification through detailed runtime analysis. - The current strategy may struggle with editing requests involving significant spatial transformations, such as shifting the viewpoint by 60 degrees or making objects "fly" in the image. The authors should address how
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Digital Humanities and Scholarship · Multimodal Machine Learning Applications
