TL;DR
RefineAnything is a multimodal diffusion-based model designed for precise local image refinement, maintaining background integrity while enhancing fine details within user-specified regions.
Contribution
It introduces a novel region-focused refinement strategy with a boundary-aware loss, and constructs a new benchmark for evaluating local image editing quality.
Findings
Achieves near-perfect background preservation in local refinement tasks.
Outperforms baselines on the RefineEval benchmark.
Demonstrates effectiveness of crop-resize and mask blending strategies.
Abstract
We introduce region-specific image refinement as a dedicated problem setting: given an input image and a user-specified region (e.g., a scribble mask or a bounding box), the goal is to restore fine-grained details while keeping all non-edited pixels strictly unchanged. Despite rapid progress in image generation, modern models still frequently suffer from local detail collapse (e.g., distorted text, logos, and thin structures). Existing instruction-driven editing models emphasize coarse-grained semantic edits and often either overlook subtle local defects or inadvertently change the background, especially when the region of interest occupies only a small portion of a fixed-resolution input. We present RefineAnything, a multimodal diffusion-based refinement model that supports both reference-based and reference-free refinement. Building on a counter-intuitive observation that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
