Fine-Grained Open-Vocabulary Object Recognition via User-Guided Segmentation
Jinwoo Ahn, Hyeokjoon Kwon, Hwiyeon Yoo

TL;DR
FOCUS is a novel foundation model-based method that enables fine-grained, open-vocabulary object detection with user-guided segmentation, improving detection of small components and incorporating user intent.
Contribution
It introduces FOCUS, a new approach that combines foundation models with user-guided natural language input for flexible, explainable, and fine-grained object detection.
Findings
Enhances detection of small object components.
Maintains high performance across diverse object types.
Reduces need for extensive user intervention.
Abstract
Recent advent of vision-based foundation models has enabled efficient and high-quality object detection at ease. Despite the success of previous studies, object detection models face limitations on capturing small components from holistic objects and taking user intention into account. To address these challenges, we propose a novel foundation model-based detection method called FOCUS: Fine-grained Open-Vocabulary Object ReCognition via User-Guided Segmentation. FOCUS merges the capabilities of vision foundation models to automate open-vocabulary object detection at flexible granularity and allow users to directly guide the detection process via natural language. It not only excels at identifying and locating granular constituent elements but also minimizes unnecessary user intervention yet grants them significant control. With FOCUS, users can make explainable requests to actively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsFocus
