TL;DR
INSID3 is a training-free method that leverages frozen DINOv3 features for versatile in-context segmentation, achieving state-of-the-art results without supervision or auxiliary models.
Contribution
It demonstrates that a single self-supervised backbone can support both semantic matching and segmentation without additional training or supervision.
Findings
Outperforms previous methods by +7.5% mIoU in segmentation tasks.
Uses 3x fewer parameters than prior approaches.
Operates without any mask or category-level supervision.
Abstract
In-context segmentation (ICS) aims to segment arbitrary concepts, e.g., objects, parts, or personalized instances, given one annotated visual examples. Existing work relies on (i) fine-tuning vision foundation models (VFMs), which improves in-domain results but harms generalization, or (ii) combines multiple frozen VFMs, which preserves generalization but yields architectural complexity and fixed segmentation granularities. We revisit ICS from a minimalist perspective and ask: Can a single self-supervised backbone support both semantic matching and segmentation, without any supervision or auxiliary models? We show that scaled-up dense self-supervised features from DINOv3 exhibit strong spatial structure and semantic correspondence. We introduce INSID3, a training-free approach that segments concepts at varying granularities only from frozen DINOv3 features, given an in-context example.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
