VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection
Chih-Chung Liu, Zhiwei Lin, Yongtao Wang

TL;DR
VL-SAM-v3 introduces a retrieval-based visual memory system that enhances open-world object detection by providing detailed visual priors, significantly improving performance especially on rare categories.
Contribution
It presents a unified framework that integrates external visual memory with detection prompts, enabling better open-vocabulary and open-ended detection.
Findings
Improves detection performance on LVIS dataset, especially for rare categories.
Enhances open-vocabulary detection with retrieval-grounded visual priors.
Validates the approach with a stronger detector, SAM3.
Abstract
Open-world object detection aims to localize and recognize objects beyond a fixed closed-set label space. It is commonly divided into two categories, i.e., open-vocabulary detection, which assumes a predefined category list at test time, and open-ended detection, which requires generating candidate categories during the inference. Existing methods rely primarily on coarse textual semantics and parametric knowledge, which often provide insufficient visual evidence for fine-grained appearance variation, rare categories, and cluttered scenes. In this paper, we propose VL-SAM-v3, a unified framework that augments open-world detection with retrieval-grounded external visual memory. Specifically, once candidate categories are available, VL-SAM-v3 retrieves relevant visual prototypes from a non-parametric memory bank and transforms them into two complementary visual priors, i.e., sparse priors…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
