EraseLoRA: MLLM-Driven Foreground Exclusion and Background Subtype Aggregation for Dataset-Free Object Removal

Sanghyun Jo; Donghwan Lee; Eunji Jung; Seong Je Oh; Kyungsu Kim

arXiv:2512.21545·cs.CV·December 29, 2025

EraseLoRA: MLLM-Driven Foreground Exclusion and Background Subtype Aggregation for Dataset-Free Object Removal

Sanghyun Jo, Donghwan Lee, Eunji Jung, Seong Je Oh, Kyungsu Kim

PDF

Open Access 3 Reviews

TL;DR

EraseLoRA introduces a dataset-free, background-aware framework for object removal that improves background reconstruction and avoids unwanted object reappearance by leveraging large-language models and test-time adaptation.

Contribution

It proposes a novel background-aware reasoning and test-time optimization approach that enhances dataset-free object removal without explicit attention manipulation.

Findings

01

Outperforms existing dataset-free methods in object removal benchmarks.

02

Achieves results competitive with dataset-driven approaches.

03

Effectively preserves background details and structure.

Abstract

Object removal differs from common inpainting, since it must prevent the masked target from reappearing and reconstruct the occluded background with structural and contextual fidelity, rather than merely filling a hole plausibly. Recent dataset-free approaches that redirect self-attention inside the mask fail in two ways: non-target foregrounds are often misinterpreted as background, which regenerates unwanted objects, and direct attention manipulation disrupts fine details and hinders coherent integration of background cues. We propose EraseLoRA, a novel dataset-free framework that replaces attention surgery with background-aware reasoning and test-time adaptation. First, Background-aware Foreground Exclusion (BFE), uses a multimodal large-language models to separate target foreground, non-target foregrounds, and clean background from a single image-mask pair without paired…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

1. Use of MLLMs for foreground-background separation in object removal is novel. 2. Motivation seems convincing. 3. The method is agnostic to the base segmentation models. 4. Strong evaluation protocol.

Weaknesses

1. Gains achieved on the SOTA are only marginal (Tab 2) and at times, loses to dataset-driven methods. 2. Given these marginal gains, statistical significance isn't given. 3. Very large computational cost - Tab 3 shows a significant increase in VRAM. 4. The method seems to be critically dependent on MLLMs. There is no clarity on what happens if and when MLLMs hallucinate objects? 5. Datasets (ROAD) are large enough, and metrics aren't representative enough. 6. There are a lot of missing

Reviewer 02Rating 4Confidence 3

Strengths

1. The proposed EraseLoRA specifically addresses the issues of unintended non-target foreground regeneration and fine-detail loss caused by disrupted short-range attention in dataset-free object removal through BRF (Background Reconstruction with Foreground Exclusion) and BSA (Background Subtype Aggregation). 2. Compared with directly constraining the attention map, the Test Time Adaption method is more robust.

Weaknesses

1. **Methodological Limitation on Occluded Non-Target Foregrounds.** While suppressing non-target foregrounds effectively avoids their interference during target-region background generation, it also creates a limitation: when non-target foregrounds are occluded by the target foreground, the method cannot reconstruct those occluded parts. This flaw is clearly visible in the provided result figures. 2. **High Computational Overhead from MLLM Dependency.** Invoking an MLLM for each inference signi

Reviewer 03Rating 6Confidence 3

Strengths

1. The motivation of this submission is clear and the design aligns well with the motavation. 2. The paper is well-written and easy to follow. 3. The qualitative results demonstrated in this submission are impressive.

Weaknesses

1. Evaluation metrics. On OpenImages there is no ground-truth removed image; so they use F-DINO/F-CLIP and B-DINO/B-CLIP. That’s reasonable, but the community may want human or more visual realism metrics. Currently the strongest claims rest on these proxy numbers. 2. Novelty is partly compositional. The proposed method is kind of good engineering on top of existing tools rather than a brand-new learning principle which might not be suitable for ICLR. 3. Robustness experiments on errors from too

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Visual Attention and Saliency Detection