VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection

Chih-Chung Liu; Zhiwei Lin; Yongtao Wang

arXiv:2605.03456·cs.CV·May 12, 2026

VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection

Chih-Chung Liu, Zhiwei Lin, Yongtao Wang

PDF

TL;DR

VL-SAM-v3 introduces a retrieval-based visual memory system that enhances open-world object detection by providing detailed visual priors, significantly improving performance especially on rare categories.

Contribution

It presents a unified framework that integrates external visual memory with detection prompts, enabling better open-vocabulary and open-ended detection.

Findings

01

Improves detection performance on LVIS dataset, especially for rare categories.

02

Enhances open-vocabulary detection with retrieval-grounded visual priors.

03

Validates the approach with a stronger detector, SAM3.

Abstract

Open-world object detection aims to localize and recognize objects beyond a fixed closed-set label space. It is commonly divided into two categories, i.e., open-vocabulary detection, which assumes a predefined category list at test time, and open-ended detection, which requires generating candidate categories during the inference. Existing methods rely primarily on coarse textual semantics and parametric knowledge, which often provide insufficient visual evidence for fine-grained appearance variation, rare categories, and cluttered scenes. In this paper, we propose VL-SAM-v3, a unified framework that augments open-world detection with retrieval-grounded external visual memory. Specifically, once candidate categories are available, VL-SAM-v3 retrieves relevant visual prototypes from a non-parametric memory bank and transforms them into two complementary visual priors, i.e., sparse priors…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.