Training-Free Open-Ended Object Detection and Segmentation via Attention as Prompts
Zhiwei Lin, Yongtao Wang, Zhi Tang

TL;DR
VL-SAM is a training-free framework that combines vision-language and segmentation models to detect and segment unseen objects in open-world scenarios without requiring object category inputs.
Contribution
This paper introduces VL-SAM, a novel training-free approach that leverages attention maps from pre-trained models for open-ended object detection and segmentation.
Findings
Outperforms previous open-ended detection methods on LVIS dataset
Provides additional instance segmentation masks without training
Demonstrates strong generalization across different models and datasets
Abstract
Existing perception models achieve great success by learning from large amounts of labeled data, but they still struggle with open-world scenarios. To alleviate this issue, researchers introduce open-set perception tasks to detect or segment unseen objects in the training set. However, these models require predefined object categories as inputs during inference, which are not available in real-world scenarios. Recently, researchers pose a new and more practical problem, \textit{i.e.}, open-ended object detection, which discovers unseen objects without any object categories as inputs. In this paper, we present VL-SAM, a training-free framework that combines the generalized object recognition model (\textit{i.e.,} Vision-Language Model) with the generalized object localization model (\textit{i.e.,} Segment-Anything Model), to address the open-ended object detection and segmentation task.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Image and Object Detection Techniques
MethodsSoftmax · Attention Is All You Need · Segment Anything Model
