GRiT: A Generative Region-to-text Transformer for Object Understanding

Jialian Wu; Jianfeng Wang; Zhengyuan Yang; Zhe Gan; Zicheng Liu,; Junsong Yuan; Lijuan Wang

arXiv:2212.00280·cs.CV·December 2, 2022·30 cites

GRiT: A Generative Region-to-text Transformer for Object Understanding

Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu,, Junsong Yuan, Lijuan Wang

PDF

Open Access 1 Repo

TL;DR

GRiT is a transformer-based model that jointly performs object detection and dense captioning by generating descriptive text for object regions, enabling richer object understanding.

Contribution

It introduces a unified generative framework that models object understanding as region-to-text pairs, capable of producing both class labels and descriptive sentences.

Findings

01

Achieves 60.4 AP on COCO detection

02

Attains 15.5 mAP on Visual Genome captioning

03

Unifies detection and captioning in a single model

Abstract

This paper presents a Generative RegIon-to-Text transformer, GRiT, for object understanding. The spirit of GRiT is to formulate object understanding as <region, text> pairs, where region locates objects and text describes objects. For example, the text in object detection denotes class names while that in dense captioning refers to descriptive sentences. Specifically, GRiT consists of a visual encoder to extract image features, a foreground object extractor to localize objects, and a text decoder to generate open-set object descriptions. With the same model architecture, GRiT can understand objects via not only simple nouns, but also rich descriptive sentences including object attributes or actions. Experimentally, we apply GRiT to object detection and dense captioning tasks. GRiT achieves 60.4 AP on COCO 2017 test-dev for object detection and 15.5 mAP on Visual Genome for dense…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

JialianW/GRiT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling