Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation
Jingxuan He, Xiyu Wang, Yunke Wang, Mengyu Zheng, Chang Xu

TL;DR
This paper explores how pretrained image editing models can be repurposed for zero-shot referring image segmentation by leveraging their inherent semantic grounding capabilities, achieving high accuracy without additional training.
Contribution
It introduces a training-free framework that exploits internal representations of image editing models for pixel-level segmentation based on natural language descriptions.
Findings
Emergence of strong foreground-background separability at early denoising steps.
The proposed method outperforms existing zero-shot baselines on standard datasets.
Accurate segmentation achieved with only a single denoising step without full image synthesis.
Abstract
Instruction-based image editing (IIE) models have recently demonstrated strong capability in modifying specific image regions according to natural language instructions, which implicitly requires identifying where an edit should be applied. This indicates that such models inherently perform language-conditioned visual semantic grounding. In this work, we investigate whether this implicit grounding can be leveraged for zero-shot referring image segmentation (RIS), a task that requires pixel-level localization of objects described by natural language expressions. Through systematic analysis, we reveal that strong foreground-background separability emerges in the internal representations of these models at the earliest denoising timestep, well before any visible image transformation occurs. Building on this insight, we propose a training-free framework that repurposes pretrained image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
