LLMDet: Learning Strong Open-Vocabulary Object Detectors under the   Supervision of Large Language Models

Shenghao Fu; Qize Yang; Qijie Mo; Junkai Yan; Xihan Wei; Jingke Meng,; Xiaohua Xie; Wei-Shi Zheng

arXiv:2501.18954·cs.CV·February 3, 2025·2 cites

LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models

Shenghao Fu, Qize Yang, Qijie Mo, Junkai Yan, Xihan Wei, Jingke Meng,, Xiaohua Xie, Wei-Shi Zheng

PDF

Open Access 1 Repo 7 Models

TL;DR

This paper introduces LLMDet, a novel open-vocabulary object detector trained with detailed image captions generated by a large language model, leading to improved detection performance and enhanced multi-modal capabilities.

Contribution

The work presents GroundingCap-1M dataset and a co-training method leveraging large language models for open-vocabulary detection, which is a new approach in the field.

Findings

01

LLMDet outperforms baseline detectors in open-vocabulary tasks.

02

Generated captions improve detection accuracy and generalization.

03

Mutual enhancement of detection and multi-modal modeling.

Abstract

Recent open-vocabulary detectors achieve promising performance with abundant region-level annotated data. In this work, we show that an open-vocabulary detector co-training with a large language model by generating image-level detailed captions for each image can further improve performance. To achieve the goal, we first collect a dataset, GroundingCap-1M, wherein each image is accompanied by associated grounding labels and an image-level detailed caption. With this dataset, we finetune an open-vocabulary detector with training objectives including a standard grounding loss and a caption generation loss. We take advantage of a large language model to generate both region-level short captions for each region of interest and image-level long captions for the whole image. Under the supervision of the large language model, the resulting detector, LLMDet, outperforms the baseline by a clear…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

isee-laboratory/llmdet
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications