HA-FGOVD: Highlighting Fine-grained Attributes via Explicit Linear   Composition for Open-Vocabulary Object Detection

Yuqi Ma; Mengyin Liu; Chao Zhu; Xu-Cheng Yin

arXiv:2409.16136·cs.CV·October 22, 2024

HA-FGOVD: Highlighting Fine-grained Attributes via Explicit Linear Composition for Open-Vocabulary Object Detection

Yuqi Ma, Mengyin Liu, Chao Zhu, Xu-Cheng Yin

PDF

Open Access

TL;DR

This paper introduces HA-FGOVD, a method that enhances open-vocabulary object detection by explicitly highlighting fine-grained attributes through linear composition, significantly improving attribute detection across models.

Contribution

The paper proposes a universal approach that explicitly highlights fine-grained attributes in frozen OVD models using linear composition, improving attribute-level detection performance.

Findings

01

Achieves state-of-the-art results on FG-OVD dataset.

02

Universal transferability of attribute scalars across models.

03

Significant improvement in fine-grained attribute detection.

Abstract

Open-vocabulary object detection (OVD) models are considered to be Large Multi-modal Models (LMM), due to their extensive training data and a large number of parameters. Mainstream OVD models prioritize object coarse-grained category rather than focus on their fine-grained attributes, e.g., colors or materials, thus failed to identify objects specified with certain attributes. However, OVD models are pretrained on large-scale image-text pairs with rich attribute words, whose latent feature space can represent the global text feature as a linear composition of fine-grained attribute tokens without highlighting them. Therefore, we propose in this paper a universal and explicit approach for frozen mainstream OVD models that boosts their attribute-level detection capabilities by highlighting fine-grained attributes in explicit linear space. Firstly, a LLM is leveraged to highlight attribute…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Image and Video Retrieval Techniques

MethodsFocus