Geometric Image Editing via Effects-Sensitive In-Context Inpainting with Diffusion Transformers
Shuo Zhang, Wenzhuo Wu, Huayu Zhang, Jiarong Cheng, Xianghao Zang, Chao Ban, Hao Sun, Zhongjiang He, Tianwei Cao, Kongming Liang, Zhanyu Ma

TL;DR
GeoEdit is a novel diffusion transformer framework that improves geometric image editing by accurately handling transformations and enhancing lighting effects, supported by a large-scale dataset and outperforming existing methods.
Contribution
The paper introduces GeoEdit, a diffusion transformer-based approach with Effects-Sensitive Attention for precise geometric edits and realistic lighting, along with the RS-Objects dataset for training.
Findings
Outperforms state-of-the-art in visual quality and realism
Achieves accurate geometric transformations in complex scenes
Enhances lighting and shadow modeling for realistic results
Abstract
Recent advances in diffusion models have significantly improved image editing. However, challenges persist in handling geometric transformations, such as translation, rotation, and scaling, particularly in complex scenes. Existing approaches suffer from two main limitations: (1) difficulty in achieving accurate geometric editing of object translation, rotation, and scaling; (2) inadequate modeling of intricate lighting and shadow effects, leading to unrealistic results. To address these issues, we propose GeoEdit, a framework that leverages in-context generation through a diffusion transformer module, which integrates geometric transformations for precise object edits. Moreover, we introduce Effects-Sensitive Attention, which enhances the modeling of intricate lighting and shadow effects for improved realism. To further support training, we construct RS-Objects, a large-scale geometric…
Peer Reviews
Decision·ICLR 2026 Poster
1 Clear motivation and problem definition – focuses on geometric (translation / rotation / scaling) image editing, which remains under-explored compared to semantic or text-guided edits. 2 Comprehensive dataset pipeline – the RS-Objects dataset seems carefully designed (render + AIGC + human), addressing a genuine data gap. 3 Strong experiments – covers both 2D and 3D edits, reports seven metrics, and includes ablations and a user study 4 Readable writing and solid figures
1 Dataset authenticity & reproducibility. whether any real photographs with ground-truth geometric edits exist for validation. Public release status is unclear 2 Many baselines are test-time or training-free. The proposed method is training-based with a large custom dataset, so the comparison is not entirely apples-to-apples. how much overhead does ESA add versus standard DiT inpainting? 3 Limitations are not discussed, what is the thing that GeoEdit can not achieve? Discussing these aspects i
1. The paper presents a clearly structured geometric editing pipeline with well-defined, reproducible steps. Each transformation—translation, rotation, and scaling—is handled through explicit procedures. The detailed description of these steps provides strong methodological clarity and makes the approach readily reproducible. 2. The proposed RS-Objects dataset is thoughtfully designed to align with the objectives of geometric image editing. It employs a two-stage rendering-to-synthesis pipeline
1. Theoretical Contribution and Clarity of ESA Theorem 3.1 offers a limited theoretical contribution. Since the hard-modulated attention focuses exclusively on insertion tokens, its KL divergence from the ideal attention map is trivially infinite, and showing that the ESA variant achieves a smaller divergence is therefore not a particularly informative result. The theorem appears unnecessary in its current form. The central issue is not the inequality itself, but the rationale behind defining th
* Well-written and structured paper with clear motivation and consistent organization. * Strong experimental performance on both quantitative and qualitative metrics across multiple tasks. * Comprehensive dataset (RS-Objects) with rigorous construction and filtering criteria; could benefit the broader community. * Practical application value in realistic scene editing and geometric transformation control.
* The proposed ESA module appears conceptually similar to a soft attention bias, and its correlation with lighting effects is not convincingly demonstrated. * The method’s heavy reliance on external geometry models (e.g., Hunyuan-3D) for 3D reconstruction compromises its originality and self-containment. It also remains unclear how GeoEdit performs when alternative geometry backbones are employed. * The model is built upon the Flux backbone, while several baselines are not, raising concerns ab
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Computer Graphics and Visualization Techniques · 3D Shape Modeling and Analysis
