FSOD-VFM: Few-Shot Object Detection with Vision Foundation Models and Graph Diffusion
Chen-Bin Feng, Youyang Sha, Longfei Liu, Yongjun Yu, Chi Man Vong, Xuanlong Yu, Xi Shen

TL;DR
This paper introduces FSOD-VFM, a novel framework leveraging vision foundation models and graph diffusion to improve few-shot object detection, significantly reducing false positives and enhancing detection accuracy without additional training.
Contribution
The paper proposes a graph-based confidence reweighting method that refines object proposals from foundation models, improving detection quality in few-shot scenarios.
Findings
Achieves 31.6 AP on CD-FSOD dataset in 10-shot setting.
Outperforms existing training-free methods significantly.
Effectively reduces false-positive proposals through graph diffusion.
Abstract
In this paper, we present FSOD-VFM: Few-Shot Object Detectors with Vision Foundation Models, a framework that leverages vision foundation models to tackle the challenge of few-shot object detection. FSOD-VFM integrates three key components: a universal proposal network (UPN) for category-agnostic bounding box generation, SAM2 for accurate mask extraction, and DINOv2 features for efficient adaptation to new object categories. Despite the strong generalization capabilities of foundation models, the bounding boxes generated by UPN often suffer from overfragmentation, covering only partial object regions and leading to numerous small, false-positive proposals rather than accurate, complete object detections. To address this issue, we introduce a novel graph-based confidence reweighting method. In our approach, predicted bounding boxes are modeled as nodes in a directed graph, with graph…
Peer Reviews
Decision·ICLR 2026 Poster
1. The idea of utilizing vision foundation models is novel. 2. Formulas are very helpful to understand the core idea of graph diffusion and the motivation of the research. 3. The proposed method also shows good generalization to cross-domain scenario, which is remarkable. 4. The proposed method achieved SOTA performance. 5. The limitation section provides good analysis of the proposed method, especially the part about the marginal improvement when the annotated samples increases.
1. In the settings of FSOD and CD-FSOD, the basic assumption is that the novel category objects are either never seen in the training set or not labeled. However, with the vision foundation model, how do the authors make sure that the novel objects is absolutely unseen? This part of statement should be more clear. 2. The proposed network explores a very different way comparing to classic FSOD paradigms, what would be the computational overhead in the proposed networks setting?
1. The proposed graph diffusion method effectively suppresses the overfragmentation without training. 2. The proposed method is simple and easy to follow. 3. The writing of the paper is clear and fluent.
There are issues with method: a) The graph diffusion module uses Equation 3 to calculate the diffusion energy between two proposed bounding boxes. However, Equation 3 relies heavily on the confidence scores provided by UPN. If fragmented bounding box have higher confidence scores, the graph diffusion process can not improve the confidence of high-quality candidate boxes. b) The method in the paper uses SAM to segment the bounding box proposals given by UPN, in order to obtain cleaner foregroun
1. The paper introduces a graph diffusion confidence reweighting mechanism to address the prevalent issue of proposal fragmentation in training-free few-shot detection. Constructing a directed graph among proposals and simulating energy propagation to effectively distinguish between complete object proposals and partial fragment proposals, thereby significantly enhancing detection granularity. 2. It is commendable that the authors have validated the method's effectiveness across multiple benc
1. The model is overly complex, employing three powerful foundation models to address a relatively specific issue (fragmented bounding boxes generated by UPN). The model complexity and computational cost are exceptionally high. SAM2 is a powerful segmentation model whose generated precise masks already contain rich structural information of objects. Why can't detection boxes be generated directly from SAM2's masks, or why can't a more streamlined pipeline be built around SAM2 as the core, rathe
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning
