Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang,, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang

TL;DR
Grounding DINO introduces a novel open-set object detection framework that combines Transformer-based detection with grounded pre-training, enabling detection of arbitrary objects using human inputs across multiple benchmarks.
Contribution
The paper proposes a new open-set object detector that integrates language and vision modalities through a tight fusion approach, extending detection capabilities to arbitrary objects and referring expressions.
Findings
Achieves 52.5 AP on COCO zero-shot transfer benchmark.
Sets a new record with 26.1 AP on ODinW zero-shot benchmark.
Performs well on benchmarks including COCO, LVIS, ODinW, and RefCOCO/+/g.
Abstract
In this paper, we present an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion. While previous works mainly evaluate open-set object detection on novel categories, we propose to also perform evaluations on referring expression comprehension for objects specified with attributes. Grounding DINO performs remarkably…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗IDEA-Research/grounding-dino-basemodel· 1.7M dl· ♡ 1731.7M dl♡ 173
- 🤗ShilongLiu/GroundingDINOmodel· ♡ 161♡ 161
- 🤗camenduru/GroundingDINOmodel· ♡ 2♡ 2
- 🤗IDEA-Research/grounding-dino-tinymodel· 628k dl· ♡ 97628k dl♡ 97
- 🤗mart9992/eri2model
- 🤗mart9992/nervnmodel· ♡ 5♡ 5
- 🤗mart9992/vierundvimodel
- 🤗kelvinou01/GroundingDINOmodel· ♡ 1♡ 1
- 🤗sheshkar/cmon2model· 11 dl11 dl
- 🤗m522t/open_groundingdinomodel· 13 dl13 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Layer Normalization · Linear Layer · Dense Connections · Residual Connection · Vision Transformer
