No time to train! Training-Free Reference-Based Instance Segmentation
Miguel Espinosa, Chenhongyi Yang, Linus Ericsson, Steven McDonagh, Elliot J. Crowley

TL;DR
This paper introduces a training-free, reference-based instance segmentation method that leverages foundation models' semantic priors to identify object regions across images, reducing annotation and prompt engineering efforts.
Contribution
It proposes a novel multi-stage, training-free approach utilizing memory banks, representation aggregation, and semantic-aware matching for segmentation based on reference images.
Findings
Achieves state-of-the-art results on COCO FSOD with 36.8% nAP.
Outperforms existing training-free methods on Cross-Domain FSOD with 22.4% nAP.
Significantly improves segmentation metrics across benchmarks.
Abstract
The performance of image segmentation models has historically been constrained by the high cost of collecting large-scale annotated data. The Segment Anything Model (SAM) alleviates this original problem through a promptable, semantics-agnostic, segmentation paradigm and yet still requires manual visual-prompts or complex domain-dependent prompt-generation rules to process a new image. Towards reducing this new burden, our work investigates the task of object segmentation when provided with, alternatively, only a small set of reference images. Our key insight is to leverage strong semantic priors, as learned by foundation models, to identify corresponding regions between a reference and a target image. We find that correspondences enable automatic generation of instance-level segmentation masks for downstream tasks and instantiate our ideas via a multi-stage, training-free method…
Peer Reviews
Decision·Submitted to ICLR 2026
1、This method not only addresses the automation challenges of frameworks like SAM—such as their "lack of semantic awareness and need for manual intervention"—but also avoids issues like overfitting and domain shift caused by the "requirement for fine-tuning on novel categories" in traditional methods. Its effectiveness has been verified through laboratory experiments. 2、The paper achieves better optimization for scenarios (e.g., camouflaged objects) that existing vision foundation models—such a
1、A key limitation of traditional DINO/CLIP-based frameworks for training-free open-vocabulary semantic segmentation (OVSeg) lies in their requirement for a predefined category list during evaluation—this prevents them from being classified as genuine open-vocabulary methods. By comparison, generative vision-language model (VLM)-based methods possess intrinsic properties that make them more adept at realizing open-domain perception. Please analyze and compare the strengths of the aforementioned
1. Practical Significance: The training-free paradigm reduces deployment costs for low-annotation domains (e.g., underwater, microscopic imaging), aligning with real-world needs for fast adaptation. 2. Efficiency-Accuracy Balance: It outperforms prior methods (e.g., Matcher) on accuracy and runs ~129x faster (0.929s/img vs. 120.014s/img), striking a rare balance. 3. Clarity & Reproducibility: The three-stage framework is visualized clearly (Fig.2), with detailed implementation details (resolutio
1. Limited Originality: The method combines off-the-shelf models (SAM/DINOv2) with classic techniques (cosine similarity, soft merging)—no novel methodology or theoretical insights, making it an engineering implementation rather than an innovation. 2. Incomplete Experiments: No ablation for two-step aggregation (e.g., instance-only vs. class-only prototypes) or memory bank design; no comparison with mainstream “VLM+SAM” pipelines, weakening competitiveness arguments. 3. Shallow Analysis: Failure
1.The paper is well-written and clearly structured. The motivation is timely and compelling—reducing annotation and training cost is highly relevant in the era of foundation models. 2. The method is elegant and efficient, with minimal overhead and no need for finetuning, making it broadly applicable to low-resource or rapid-deployment scenarios. 3. The method performs competitively (or even better) than fine-tuned approaches on COCO-FSOD, PASCAL-FSOD, and CD-FSOD, with good cross-domain genera
1. Limited discussion of some generalist model such as DINO-X, SINE, and T-REX: These methods seems to also support reference-based Instance Segmentation. The current paper does not systematically analyze how its method compares in terms of design, generalization, or efficiency. 2.Evaluation scope: results emphasize benchmarks; no demonstration on real-world deployment or interactive scenarios.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Advanced Neural Network Applications
MethodsSparse Evolutionary Training
