TL;DR
R-Genie introduces a novel image editing approach that combines diffusion models with reasoning capabilities of large language models, enabling complex, context-aware image modifications based on multi-faceted textual queries.
Contribution
The paper presents R-Genie, a new framework integrating reasoning with generative image editing, supported by a large dataset and a reasoning-attention mechanism for improved understanding and editing.
Findings
R-Genie effectively handles complex, reasoning-based image editing tasks.
The dataset enables training models with rich reasoning and real-world knowledge.
Experimental results show enhanced editing capabilities with reasoning integration.
Abstract
While recent advances in image editing have enabled impressive visual synthesis capabilities, current methods remain constrained by explicit textual instructions and limited editing operations, lacking deep comprehension of implicit user intentions and contextual reasoning. In this work, we introduce a new image editing paradigm: reasoning-guided generative editing, which synthesizes images based on complex, multi-faceted textual queries accepting world knowledge and intention inference. To facilitate this task, we first construct a comprehensive dataset featuring over 1,000 image-instruction-edit triples that incorporate rich reasoning contexts and real-world knowledge. We then propose R-Genie: a reasoning-guided generative image editor, which synergizes the generation power of diffusion models with advanced reasoning capabilities of multimodal large language models. R-Genie…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper presents a technically complete framework combining diffusion models and MLLMs, reflecting a solid understanding of both reasoning and generative modeling. 2. The introduction of a new dataset (REditBench) focusing on reasoning-based image edits is valuable for benchmarking future research in this area. 3. The experimental evaluation is comprehensive, including quantitative comparisons, ablation studies, and user studies, which enhance the credibility of the results.
1. Motivation is unconvincing. The paper argues that implicit user intentions should be inferred by the model, but it is unclear why this is necessary for image editing. In practice, users could simply provide explicit, straightforward editing instructions; forcing the model to infer “hidden” intentions may not be a meaningful or realistic goal. In addition, Figure 1 is not intuitive and lacks side-by-side comparisons with existing instruction formats, which would make the contribution clearer.
The task that the paper addresses is meaningful as it enables reasoning-based, complex image editing.
1. Limited scale of the dataset. The paper proposes a new benchmark, REditBench, with 1070 images (850 / 220 train/val split). The training set is not sufficient for training, especially when using the contrastive objective in Eqn 6. 2. Limited evaluations. The authors only showed comparisons on REditBench, while leaving out more commonly used benchmarks such as MagicBrush, Emu, and SmartEdit. The proposed method likely overfits the proposed training dataset, resulting in better evaluation resu
1. The core contribution is combining reasong capablities of MLLMs with diffusion models allowing complex editing requiring reasoning and world knowledge. 2. Dataset contribution, this paper proposed REditBench, offering a new dataset focused on reasoning intensive edits. 3. This paper introduces a hierarchical reasoning module and a reasoning-attention bridge to enable precise and rational instruction-based image editing.
1. REditBench is relatively small compared to typical image editing dataset. It would be better to include more open-world images and instructions. 2. Limited instruction diversity, most examples shown in the paper focus on attribute reasong and compositional edits, e.g. replace, move. 3. Lack of Cross-dataset evaluation. All main results are built on REditBench, performance on other widely adopted editing dataset (e.g. GEdit and ImgEdit) would imporve external validity.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
