Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set   Object Detection

Shilong Liu; Zhaoyang Zeng; Tianhe Ren; Feng Li; Hao Zhang; Jie Yang,; Qing Jiang; Chunyuan Li; Jianwei Yang; Hang Su; Jun Zhu; Lei Zhang

arXiv:2303.05499·cs.CV·July 22, 2024·243 cites

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang,, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang

PDF

Open Access 5 Repos 10 Models 3 Datasets

TL;DR

Grounding DINO introduces a novel open-set object detection framework that combines Transformer-based detection with grounded pre-training, enabling detection of arbitrary objects using human inputs across multiple benchmarks.

Contribution

The paper proposes a new open-set object detector that integrates language and vision modalities through a tight fusion approach, extending detection capabilities to arbitrary objects and referring expressions.

Findings

01

Achieves 52.5 AP on COCO zero-shot transfer benchmark.

02

Sets a new record with 26.1 AP on ODinW zero-shot benchmark.

03

Performs well on benchmarks including COCO, LVIS, ODinW, and RefCOCO/+/g.

Abstract

In this paper, we present an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion. While previous works mainly evaluate open-set object detection on novel categories, we propose to also perform evaluations on referring expression comprehension for objects specified with attributes. Grounding DINO performs remarkably…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Layer Normalization · Linear Layer · Dense Connections · Residual Connection · Vision Transformer