Interaction-Consistent Object Removal via MLLM-Based Reasoning
Ching-Kai Huang, Wen-Chieh Lin, Yan-Cen Lee

TL;DR
This paper introduces a new framework called REORM that uses multimodal large language models to perform interaction-consistent object removal in images, addressing the challenge of removing related interaction elements for more semantically coherent results.
Contribution
The paper formalizes the Interaction-Consistent Object Removal problem and proposes REORM, a novel reasoning-enhanced framework leveraging MLLMs for joint removal of objects and interaction elements.
Findings
REORM outperforms existing image editing systems on the ICOREval benchmark.
REORM effectively removes interaction elements, ensuring semantic consistency.
The modular design of REORM enables accurate editing even with limited resources.
Abstract
Image-based object removal often erases only the named target, leaving behind interaction evidence that renders the result semantically inconsistent. We formalize this problem as Interaction-Consistent Object Removal (ICOR), which requires removing not only the target object but also associated interaction elements, such as lighting-dependent effects, physically connected objects, targetproduced elements, and contextually linked objects. To address this task, we propose Reasoning-Enhanced Object Removal with MLLM (REORM), a reasoningenhanced object removal framework that leverages multimodal large language models to infer which elements must be jointly removed. REORM features a modular design that integrates MLLM-driven analysis, mask-guided removal, and a self-correction mechanism, along with a local-deployment variant that supports accurate editing under limited resources. To support…
Peer Reviews
Decision·Submitted to ICLR 2026
### Originality Defines a new task: Interaction-Consistent Object Removal, emphasizing removal of both objects and their interaction elements. The introduction of Interaction-Consistent Object Removal extends classical object removal to a more semantically complete problem, emphasizing real-world plausibility. ### Quality Technically solid modular design combining reasoning, segmentation, and diffusion-based inpainting. ### Clarity The paper is clearly structured with well-designed figures exp
1. While the task definition is new, the core framework mainly orchestrates existing tools (GPT-4o, Grounded-SAM, ObjectClear). The contribution lies more in integration and reasoning design than in fundamental algorithmic innovation. 2. The use of MLLM reasoning to guide inpainting is not entirely novel — similar techniques appear in BrushEdit[1] and Magicquill[2]. 3. The main version depends on closed APIs (GPT-4o), raising concerns about reproducibility, cost, and long-term accessibility.
1. The paper introduces a meaningful task, Interaction-Consistent Object Removal, extending traditional object removal to model semantic and physical interactions among scene elements. 2. The inclusion of a local deployment variant demonstrates practical value and thoughtful consideration of real-world constraints.
1. The ICOR task can be viewed as an extension of prior object-effect removal methods, expanding from handling only lighting-dependent effects to covering more types, such as physically connected objects, target-produced elements, and contextually linked objects. However, if paired training data were available, prior methods might also be able to handle these limited cases, which would limit the novelty. 2. The ICOREval dataset, while a contribution of the paper, includes only 72 examples, which
The paper clearly formulates a refined variant of the object removal task that accounts for both the target object and its surrounding contextual or interacting elements, such as shadows, reflections, and connected parts. This makes the problem definition more realistic and relevant to real-world editing scenarios. The authors build a complete end-to-end system and introduce a new benchmark, ICOREval, specifically designed to evaluate interaction-consistent removal quality. Experimental results
1. There seems to be a collateral deletion issue — for example, in Fig. 1 (iii), background objects are also removed, indicating the presence of erroneous behavior. 2. The error analysis is insufficient. It would be valuable to discuss whether error accumulation occurs when combining LLMs and MLLMs. 3. The method has not been evaluated on existing benchmark datasets, such as MagicBrush, which limits the fairness and completeness of the comparison. 4. It is recommended to include inversion-bas
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Image Enhancement Techniques · Computer Graphics and Visualization Techniques
