VL-SAM-V2: Open-World Object Detection with General and Specific Query Fusion

Zhiwei Lin; Yongtao Wang

arXiv:2505.18986·cs.CV·May 27, 2025

VL-SAM-V2: Open-World Object Detection with General and Specific Query Fusion

Zhiwei Lin, Yongtao Wang

PDF

Open Access

TL;DR

VL-SAM-V2 is an innovative open-world object detection framework that fuses open-set and open-ended queries, enabling discovery of unseen objects with improved performance, especially on rare categories.

Contribution

The paper introduces a novel query fusion module and ranked learnable queries for open-world detection, enhancing the ability to discover unseen objects without human input.

Findings

01

Outperforms previous open-set and open-ended methods on LVIS.

02

Excels particularly on rare object categories.

03

Demonstrates flexible evaluation in open-set and open-ended modes.

Abstract

Current perception models have achieved remarkable success by leveraging large-scale labeled datasets, but still face challenges in open-world environments with novel objects. To address this limitation, researchers introduce open-set perception models to detect or segment arbitrary test-time user-input categories. However, open-set models rely on human involvement to provide predefined object categories as input during inference. More recently, researchers have framed a more realistic and challenging task known as open-ended perception that aims to discover unseen objects without requiring any category-level input from humans at inference time. Nevertheless, open-ended models suffer from low performance compared to open-set models. In this paper, we present VL-SAM-V2, an open-world object detection framework that is capable of discovering unseen objects while achieving favorable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Advanced Neural Network Applications