FSOD-VFM: Few-Shot Object Detection with Vision Foundation Models and Graph Diffusion

Chen-Bin Feng; Youyang Sha; Longfei Liu; Yongjun Yu; Chi Man Vong; Xuanlong Yu; Xi Shen

arXiv:2602.03137·cs.CV·February 4, 2026

FSOD-VFM: Few-Shot Object Detection with Vision Foundation Models and Graph Diffusion

Chen-Bin Feng, Youyang Sha, Longfei Liu, Yongjun Yu, Chi Man Vong, Xuanlong Yu, Xi Shen

PDF

Open Access 3 Reviews

TL;DR

This paper introduces FSOD-VFM, a novel framework leveraging vision foundation models and graph diffusion to improve few-shot object detection, significantly reducing false positives and enhancing detection accuracy without additional training.

Contribution

The paper proposes a graph-based confidence reweighting method that refines object proposals from foundation models, improving detection quality in few-shot scenarios.

Findings

01

Achieves 31.6 AP on CD-FSOD dataset in 10-shot setting.

02

Outperforms existing training-free methods significantly.

03

Effectively reduces false-positive proposals through graph diffusion.

Abstract

In this paper, we present FSOD-VFM: Few-Shot Object Detectors with Vision Foundation Models, a framework that leverages vision foundation models to tackle the challenge of few-shot object detection. FSOD-VFM integrates three key components: a universal proposal network (UPN) for category-agnostic bounding box generation, SAM2 for accurate mask extraction, and DINOv2 features for efficient adaptation to new object categories. Despite the strong generalization capabilities of foundation models, the bounding boxes generated by UPN often suffer from overfragmentation, covering only partial object regions and leading to numerous small, false-positive proposals rather than accurate, complete object detections. To address this issue, we introduce a novel graph-based confidence reweighting method. In our approach, predicted bounding boxes are modeled as nodes in a directed graph, with graph…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 5

Strengths

1. The idea of utilizing vision foundation models is novel. 2. Formulas are very helpful to understand the core idea of graph diffusion and the motivation of the research. 3. The proposed method also shows good generalization to cross-domain scenario, which is remarkable. 4. The proposed method achieved SOTA performance. 5. The limitation section provides good analysis of the proposed method, especially the part about the marginal improvement when the annotated samples increases.

Weaknesses

1. In the settings of FSOD and CD-FSOD, the basic assumption is that the novel category objects are either never seen in the training set or not labeled. However, with the vision foundation model, how do the authors make sure that the novel objects is absolutely unseen? This part of statement should be more clear. 2. The proposed network explores a very different way comparing to classic FSOD paradigms, what would be the computational overhead in the proposed networks setting?

Reviewer 02Rating 4Confidence 4

Strengths

1. The proposed graph diffusion method effectively suppresses the overfragmentation without training. 2. The proposed method is simple and easy to follow. 3. The writing of the paper is clear and fluent.

Weaknesses

There are issues with method: a) The graph diffusion module uses Equation 3 to calculate the diffusion energy between two proposed bounding boxes. However, Equation 3 relies heavily on the confidence scores provided by UPN. If fragmented bounding box have higher confidence scores, the graph diffusion process can not improve the confidence of high-quality candidate boxes. b) The method in the paper uses SAM to segment the bounding box proposals given by UPN, in order to obtain cleaner foregroun

Reviewer 03Rating 4Confidence 4

Strengths

1. The paper introduces a graph diffusion confidence reweighting mechanism to address the prevalent issue of proposal fragmentation in training-free few-shot detection. Constructing a directed graph among proposals and simulating energy propagation to effectively distinguish between complete object proposals and partial fragment proposals, thereby significantly enhancing detection granularity. 2. It is commendable that the authors have validated the method's effectiveness across multiple benc

Weaknesses

1. The model is overly complex, employing three powerful foundation models to address a relatively specific issue (fragmented bounding boxes generated by UPN). The model complexity and computational cost are exceptionally high. SAM2 is a powerful segmentation model whose generated precise masks already contain rich structural information of objects. Why can't detection boxes be generated directly from SAM2's masks, or why can't a more streamlined pipeline be built around SAM2 as the core, rathe

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning