GIT: A Generative Image-to-text Transformer for Vision and Language

Jianfeng Wang; Zhengyuan Yang; Xiaowei Hu; Linjie Li; Kevin Lin; Zhe; Gan; Zicheng Liu; Ce Liu; Lijuan Wang

arXiv:2205.14100·cs.CV·December 19, 2022·208 cites

GIT: A Generative Image-to-text Transformer for Vision and Language

Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe, Gan, Zicheng Liu, Ce Liu, Lijuan Wang

PDF

Open Access 1 Repo 10 Models

TL;DR

GIT is a simplified, large-scale generative transformer model that unifies vision-language tasks, achieving state-of-the-art results on multiple benchmarks without complex architectures or external modules.

Contribution

The paper introduces GIT, a streamlined generative image-to-text transformer that simplifies architecture and scales data, setting new performance standards in vision-language tasks.

Findings

01

Surpasses human performance on TextCaps dataset.

02

Achieves state-of-the-art results on 12 benchmarks.

03

Introduces a new generation-based image classification scheme.

Abstract

In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide a consistent network architecture between pre-training and fine-tuning, existing work typically contains complex structures (uni/multi-modal encoder/decoder) and depends on external modules such as object detectors/taggers and optical character recognition (OCR). In GIT, we simplify the architecture as one image encoder and one text decoder under a single language modeling task. We also scale up the pre-training data and the model size to boost the model performance. Without bells and whistles, our GIT establishes new state of the arts on 12 challenging benchmarks with a large margin. For instance, our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/GenerativeImage2Text
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Natural Language Processing Techniques

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Softmax · Dense Connections · Absolute Position Encodings · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Multi-Head Attention