NOCTIS: Novel Object Cyclic Threshold based Instance Segmentation
Max Gandyra, Alessandro Santonicola, Michael Beetz

TL;DR
NOCTIS is a training-free instance segmentation framework that combines pre-trained models with a cyclic thresholding mechanism to accurately segment novel objects in RGB images without additional training.
Contribution
It introduces a novel cyclic thresholding method and an RGB-only pipeline that outperform existing RGB and RGB-D methods on unseen object segmentation tasks.
Findings
Outperforms state-of-the-art methods on BOP 2023 datasets
Does not require further training or fine-tuning
Works effectively with only RGB data
Abstract
Instance segmentation of novel objects instances in RGB images, given some example images for each object, is a well known problem in computer vision. Designing a model general enough to be employed for all kinds of novel objects without (re-) training has proven to be a difficult task. To handle this, we present a new training-free framework, called: Novel Object Cyclic Threshold based Instance Segmentation (NOCTIS). NOCTIS integrates two pre-trained models: Grounded-SAM 2 for object proposals with precise bounding boxes and corresponding segmentation masks; and DINOv2 for robust class and patch embeddings, due to its zero-shot capabilities. Internally, the proposal-object matching is realized by determining an object matching score based on the similarity of the class embeddings and the average maximum similarity of the patch embeddings with a new cyclic thresholding (CT) mechanism…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1) Proposes a training-free method for novel class instance segmentation. 2) Proposes the cyclic thresholding method to mitigate the multi-to-one matching problem caused by strict matching. 3) Achieves SOTA or SOTA-comparable performance.
1) What is the relationship between the task defined in this paper and open-set/open vocabulary instance segmentation? The author claims it is infeasible to train an instance segmentor that can cover sufficiently many instances, yet this is precisely what open-set/open vocabulary instance segmentation tasks aim to achieve. These tasks also design their models based on the generalization capabilities of powerful pre-trained models like SAM and DINO. 2) The related work section also lacks a compa
- Performance: This paper successfully integrates multiple methodological approaches, thereby translating the generalization capability of foundation models into State-of-the-Art performance in the BOP 2023 challenge. - Reproducibility: The paper provides a large amount of detailed description about the experimental setup (e.g., software versions used, random seed, hardware configuration, and running time), which is very helpful for ensuring the good reproducibility of the results. - One of the
1. Lack of Novelty: It seems that the proposed method primarily relies on the integration of minor innovations (e.g., confidence and score aggregation) on top of existing foundation models. It is largely built upon prior works such as CNOS[1] and SAM-6D[2], resulting in limited incremental novelty. 2. Although the Cyclic Thresholding (CT) mechanism is highlighted as a major innovation, its actual performance gain is extremely limited (0.512 to 0.516 in ablation), which is disproportionate to the
1. SOTA on RGB-Only: Achieved SOTA on the BOP benchmark using only RGB images, outperforming methods that rely on RGB-D (depth) data. 2. Novel CT Matching Algorithm: This paper introduced the "Cyclic Thresholding" (CT) mechanism, a new and effective algorithm that addresses DINOv2's matching instability on repetitive textures.
First, the paper's core premise of "novelty" is questionable. The framework relies heavily on foundation models (GSAM 2 and DINOv2) that were pre-trained on massive datasets. It is highly probable that these models have already "seen" the object categories present in the BOP benchmark. Therefore, the "zero-shot" capability claimed is more a feat of the models' generalization than true segmentation of unseen objects. Second, the innovation is incremental and best described as a clever engineerin
1. The proposed approach is straightforward and not difficult to understand. 2. Experiments on NOCTIS yield impressive results.
1. The overall contribution, which I believe centers around Eq.4, appears limited for an ICLR submission. 2. The claimed contribution on "removing selection bias" is not supported by experiments. 3. The proposed approach introduces additional parameters such as CT and $w_{appe}$, which adds to the difficulties in parameter tuning for real-world usage. 4. (Minor) It appears to me that the paper's choice of language style is more like a speech than a research paper.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Face recognition and analysis · Advanced Image and Video Retrieval Techniques
