InstructDET: Diversifying Referring Object Detection with Generalized Instructions
Ronghao Dang, Jiangyan Feng, Haodong Zhang, Chongjian Ge, Lin Song,, Lijun Gong, Chengju Liu, Qijun Chen, Feng Zhu, Rui Zhao, Yibing Song

TL;DR
InstructDET introduces a data-centric approach for referring object detection by generating diversified instructions using foundation models, significantly enhancing detection performance and generalization across datasets.
Contribution
The paper presents InstructDET, a novel method that leverages foundation models to generate diverse instructions for training, expanding existing datasets and improving referring object detection.
Findings
Outperforms existing methods on standard REC datasets.
Demonstrates effective instruction generation using foundation models.
Enables generalization to new detection instructions.
Abstract
We propose InstructDET, a data-centric method for referring object detection (ROD) that localizes target objects based on user instructions. While deriving from referring expressions (REC), the instructions we leverage are greatly diversified to encompass common user intentions related to object detection. For one image, we produce tremendous instructions that refer to every single object and different combinations of multiple objects. Each instruction and its corresponding object bounding boxes (bbxs) constitute one training data pair. In order to encompass common detection expressions, we involve emerging vision-language model (VLM) and large language model (LLM) to generate instructions guided by text prompts and object bbxs, as the generalizations of foundation models are effective to produce human-like expressions (e.g., describing object property, category, and relationship). We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
