Fine-Grained Open-Vocabulary Object Recognition via User-Guided   Segmentation

Jinwoo Ahn; Hyeokjoon Kwon; Hwiyeon Yoo

arXiv:2411.15620·cs.CV·November 26, 2024

Fine-Grained Open-Vocabulary Object Recognition via User-Guided Segmentation

Jinwoo Ahn, Hyeokjoon Kwon, Hwiyeon Yoo

PDF

Open Access

TL;DR

FOCUS is a novel foundation model-based method that enables fine-grained, open-vocabulary object detection with user-guided segmentation, improving detection of small components and incorporating user intent.

Contribution

It introduces FOCUS, a new approach that combines foundation models with user-guided natural language input for flexible, explainable, and fine-grained object detection.

Findings

01

Enhances detection of small object components.

02

Maintains high performance across diverse object types.

03

Reduces need for extensive user intervention.

Abstract

Recent advent of vision-based foundation models has enabled efficient and high-quality object detection at ease. Despite the success of previous studies, object detection models face limitations on capturing small components from holistic objects and taking user intention into account. To address these challenges, we propose a novel foundation model-based detection method called FOCUS: Fine-grained Open-Vocabulary Object ReCognition via User-Guided Segmentation. FOCUS merges the capabilities of vision foundation models to automate open-vocabulary object detection at flexible granularity and allow users to directly guide the detection process via natural language. It not only excels at identifying and locating granular constituent elements but also minimizes unnecessary user intervention yet grants them significant control. With FOCUS, users can make explainable requests to actively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsFocus