RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details

Dewei Zhou; You Li; Zongxin Yang; Yi Yang

arXiv:2604.06870·cs.CV·April 9, 2026

RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details

Dewei Zhou, You Li, Zongxin Yang, Yi Yang

PDF

2 Repos

TL;DR

RefineAnything is a multimodal diffusion-based model designed for precise local image refinement, maintaining background integrity while enhancing fine details within user-specified regions.

Contribution

It introduces a novel region-focused refinement strategy with a boundary-aware loss, and constructs a new benchmark for evaluating local image editing quality.

Findings

01

Achieves near-perfect background preservation in local refinement tasks.

02

Outperforms baselines on the RefineEval benchmark.

03

Demonstrates effectiveness of crop-resize and mask blending strategies.

Abstract

We introduce region-specific image refinement as a dedicated problem setting: given an input image and a user-specified region (e.g., a scribble mask or a bounding box), the goal is to restore fine-grained details while keeping all non-edited pixels strictly unchanged. Despite rapid progress in image generation, modern models still frequently suffer from local detail collapse (e.g., distorted text, logos, and thin structures). Existing instruction-driven editing models emphasize coarse-grained semantic edits and often either overlook subtle local defects or inadvertently change the background, especially when the region of interest occupies only a small portion of a fixed-resolution input. We present RefineAnything, a multimodal diffusion-based refinement model that supports both reference-based and reference-free refinement. Building on a counter-intuitive observation that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.