InstructDET: Diversifying Referring Object Detection with Generalized   Instructions

Ronghao Dang; Jiangyan Feng; Haodong Zhang; Chongjian Ge; Lin Song,; Lijun Gong; Chengju Liu; Qijun Chen; Feng Zhu; Rui Zhao; Yibing Song

arXiv:2310.05136·cs.AI·March 12, 2024·2 cites

InstructDET: Diversifying Referring Object Detection with Generalized Instructions

Ronghao Dang, Jiangyan Feng, Haodong Zhang, Chongjian Ge, Lin Song,, Lijun Gong, Chengju Liu, Qijun Chen, Feng Zhu, Rui Zhao, Yibing Song

PDF

Open Access 1 Repo

TL;DR

InstructDET introduces a data-centric approach for referring object detection by generating diversified instructions using foundation models, significantly enhancing detection performance and generalization across datasets.

Contribution

The paper presents InstructDET, a novel method that leverages foundation models to generate diverse instructions for training, expanding existing datasets and improving referring object detection.

Findings

01

Outperforms existing methods on standard REC datasets.

02

Demonstrates effective instruction generation using foundation models.

03

Enables generalization to new detection instructions.

Abstract

We propose InstructDET, a data-centric method for referring object detection (ROD) that localizes target objects based on user instructions. While deriving from referring expressions (REC), the instructions we leverage are greatly diversified to encompass common user intentions related to object detection. For one image, we produce tremendous instructions that refer to every single object and different combinations of multiple objects. Each instruction and its corresponding object bounding boxes (bbxs) constitute one training data pair. In order to encompass common detection expressions, we involve emerging vision-language model (VLM) and large language model (LLM) to generate instructions guided by text prompts and object bbxs, as the generalizations of foundation models are effective to produce human-like expressions (e.g., describing object property, category, and relationship). We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jyfenggogo/instructdet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling