TL;DR
GUIDED introduces a modular framework for fine-grained open-vocabulary object detection, disentangling subject localization from attribute recognition to improve accuracy and robustness.
Contribution
It proposes a novel decomposition approach that separates localization and recognition, with attribute fusion and discrimination modules, achieving state-of-the-art results.
Findings
Achieves new state-of-the-art on FG-OVD and 3F-OVD benchmarks.
Effectively disentangles subject localization from attribute recognition.
Improves detection accuracy by mitigating attribute over-representation.
Abstract
Fine-grained open-vocabulary object detection (FG-OVD) aims to detect novel object categories described by attribute-rich texts. While existing open-vocabulary detectors show promise at the base-category level, they underperform in fine-grained settings due to the semantic entanglement of subjects and attributes in pretrained vision-language model (VLM) embeddings -- leading to over-representation of attributes, mislocalization, and semantic drift in embedding space. We propose GUIDED, a decomposition framework specifically designed to address the semantic entanglement between subjects and attributes in fine-grained prompts. By separating object localization and fine-grained recognition into distinct pathways, HUIDED aligns each subtask with the module best suited for its respective roles. Specifically, given a fine-grained class name, we first use a language model to extract a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
