No time to train! Training-Free Reference-Based Instance Segmentation

Miguel Espinosa; Chenhongyi Yang; Linus Ericsson; Steven McDonagh; Elliot J. Crowley

arXiv:2507.02798·cs.CV·February 4, 2026

No time to train! Training-Free Reference-Based Instance Segmentation

Miguel Espinosa, Chenhongyi Yang, Linus Ericsson, Steven McDonagh, Elliot J. Crowley

PDF

Open Access 2 Repos 3 Reviews

TL;DR

This paper introduces a training-free, reference-based instance segmentation method that leverages foundation models' semantic priors to identify object regions across images, reducing annotation and prompt engineering efforts.

Contribution

It proposes a novel multi-stage, training-free approach utilizing memory banks, representation aggregation, and semantic-aware matching for segmentation based on reference images.

Findings

01

Achieves state-of-the-art results on COCO FSOD with 36.8% nAP.

02

Outperforms existing training-free methods on Cross-Domain FSOD with 22.4% nAP.

03

Significantly improves segmentation metrics across benchmarks.

Abstract

The performance of image segmentation models has historically been constrained by the high cost of collecting large-scale annotated data. The Segment Anything Model (SAM) alleviates this original problem through a promptable, semantics-agnostic, segmentation paradigm and yet still requires manual visual-prompts or complex domain-dependent prompt-generation rules to process a new image. Towards reducing this new burden, our work investigates the task of object segmentation when provided with, alternatively, only a small set of reference images. Our key insight is to leverage strong semantic priors, as learned by foundation models, to identify corresponding regions between a reference and a target image. We find that correspondences enable automatic generation of instance-level segmentation masks for downstream tasks and instantiate our ideas via a multi-stage, training-free method…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

1、This method not only addresses the automation challenges of frameworks like SAM—such as their "lack of semantic awareness and need for manual intervention"—but also avoids issues like overfitting and domain shift caused by the "requirement for fine-tuning on novel categories" in traditional methods. Its effectiveness has been verified through laboratory experiments. 2、The paper achieves better optimization for scenarios (e.g., camouflaged objects) that existing vision foundation models—such a

Weaknesses

1、A key limitation of traditional DINO/CLIP-based frameworks for training-free open-vocabulary semantic segmentation (OVSeg) lies in their requirement for a predefined category list during evaluation—this prevents them from being classified as genuine open-vocabulary methods. By comparison, generative vision-language model (VLM)-based methods possess intrinsic properties that make them more adept at realizing open-domain perception. Please analyze and compare the strengths of the aforementioned

Reviewer 02Rating 4Confidence 5

Strengths

1. Practical Significance: The training-free paradigm reduces deployment costs for low-annotation domains (e.g., underwater, microscopic imaging), aligning with real-world needs for fast adaptation. 2. Efficiency-Accuracy Balance: It outperforms prior methods (e.g., Matcher) on accuracy and runs ~129x faster (0.929s/img vs. 120.014s/img), striking a rare balance. 3. Clarity & Reproducibility: The three-stage framework is visualized clearly (Fig.2), with detailed implementation details (resolutio

Weaknesses

1. Limited Originality: The method combines off-the-shelf models (SAM/DINOv2) with classic techniques (cosine similarity, soft merging)—no novel methodology or theoretical insights, making it an engineering implementation rather than an innovation. 2. Incomplete Experiments: No ablation for two-step aggregation (e.g., instance-only vs. class-only prototypes) or memory bank design; no comparison with mainstream “VLM+SAM” pipelines, weakening competitiveness arguments. 3. Shallow Analysis: Failure

Reviewer 03Rating 6Confidence 4

Strengths

1.The paper is well-written and clearly structured. The motivation is timely and compelling—reducing annotation and training cost is highly relevant in the era of foundation models. 2. The method is elegant and efficient, with minimal overhead and no need for finetuning, making it broadly applicable to low-resource or rapid-deployment scenarios. 3. The method performs competitively (or even better) than fine-tuned approaches on COCO-FSOD, PASCAL-FSOD, and CD-FSOD, with good cross-domain genera

Weaknesses

1. Limited discussion of some generalist model such as DINO-X, SINE, and T-REX: These methods seems to also support reference-based Instance Segmentation. The current paper does not systematically analyze how its method compares in terms of design, generalization, or efficiency. 2.Evaluation scope: results emphasize benchmarks; no demonstration on real-world deployment or interactive scenarios.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Advanced Neural Network Applications

MethodsSparse Evolutionary Training