GLIPv2: Unifying Localization and Vision-Language Understanding

Haotian Zhang; Pengchuan Zhang; Xiaowei Hu; Yen-Chun Chen; Liunian; Harold Li; Xiyang Dai; Lijuan Wang; Lu Yuan; Jenq-Neng Hwang; Jianfeng Gao

arXiv:2206.05836·cs.CV·October 13, 2022·126 cites

GLIPv2: Unifying Localization and Vision-Language Understanding

Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian, Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, Jianfeng Gao

PDF

Open Access 1 Repo 1 Video

TL;DR

GLIPv2 is a unified vision-language model that combines localization and understanding tasks through novel pre-training tasks, achieving near state-of-the-art results and strong zero-shot capabilities.

Contribution

It introduces a unified pre-training framework that integrates localization and vision-language understanding tasks, simplifying previous methods and enhancing performance.

Findings

01

Achieves near state-of-the-art on multiple tasks

02

Demonstrates strong zero-shot and few-shot detection

03

Shows superior grounding capabilities

Abstract

We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e.g., object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e.g., VQA, image captioning). GLIPv2 elegantly unifies localization pre-training and Vision-Language Pre-training (VLP) with three pre-training tasks: phrase grounding as a VL reformulation of the detection task, region-word contrastive learning as a novel region-word level contrastive learning task, and the masked language modeling. This unification not only simplifies the previous multi-stage VLP procedure but also achieves mutual benefits between localization and understanding tasks. Experimental results show that a single GLIPv2 model (all model weights are shared) achieves near SoTA performance on various localization and understanding tasks. The model also shows (1) strong zero-shot and few-shot…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/GLIP
pytorchOfficial

Videos

GLIPv2: Unifying Localization and Vision-Language Understanding· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsContrastive Learning