LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
Shenghao Fu, Qize Yang, Qijie Mo, Junkai Yan, Xihan Wei, Jingke Meng,, Xiaohua Xie, Wei-Shi Zheng

TL;DR
This paper introduces LLMDet, a novel open-vocabulary object detector trained with detailed image captions generated by a large language model, leading to improved detection performance and enhanced multi-modal capabilities.
Contribution
The work presents GroundingCap-1M dataset and a co-training method leveraging large language models for open-vocabulary detection, which is a new approach in the field.
Findings
LLMDet outperforms baseline detectors in open-vocabulary tasks.
Generated captions improve detection accuracy and generalization.
Mutual enhancement of detection and multi-modal modeling.
Abstract
Recent open-vocabulary detectors achieve promising performance with abundant region-level annotated data. In this work, we show that an open-vocabulary detector co-training with a large language model by generating image-level detailed captions for each image can further improve performance. To achieve the goal, we first collect a dataset, GroundingCap-1M, wherein each image is accompanied by associated grounding labels and an image-level detailed caption. With this dataset, we finetune an open-vocabulary detector with training objectives including a standard grounding loss and a caption generation loss. We take advantage of a large language model to generate both region-level short captions for each region of interest and image-level long captions for the whole image. Under the supervision of the large language model, the resulting detector, LLMDet, outperforms the baseline by a clear…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗fushh7/LLMDetmodel· ♡ 23♡ 23
- 🤗fushh7/llmdet_swin_tiny_hfmodel· 2.2k dl· ♡ 32.2k dl♡ 3
- 🤗fushh7/llmdet_swin_base_hfmodel· 15 dl15 dl
- 🤗fushh7/llmdet_swin_large_hfmodel· 469 dl· ♡ 6469 dl♡ 6
- 🤗iSEE-Laboratory/llmdet_tinymodel· 535 dl· ♡ 6535 dl♡ 6
- 🤗iSEE-Laboratory/llmdet_basemodel· 509k dl· ♡ 9509k dl♡ 9
- 🤗iSEE-Laboratory/llmdet_largemodel· 8.5k dl· ♡ 178.5k dl♡ 17
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
