GIT: A Generative Image-to-text Transformer for Vision and Language
Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe, Gan, Zicheng Liu, Ce Liu, Lijuan Wang

TL;DR
GIT is a simplified, large-scale generative transformer model that unifies vision-language tasks, achieving state-of-the-art results on multiple benchmarks without complex architectures or external modules.
Contribution
The paper introduces GIT, a streamlined generative image-to-text transformer that simplifies architecture and scales data, setting new performance standards in vision-language tasks.
Findings
Surpasses human performance on TextCaps dataset.
Achieves state-of-the-art results on 12 benchmarks.
Introduces a new generation-based image classification scheme.
Abstract
In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide a consistent network architecture between pre-training and fine-tuning, existing work typically contains complex structures (uni/multi-modal encoder/decoder) and depends on external modules such as object detectors/taggers and optical character recognition (OCR). In GIT, we simplify the architecture as one image encoder and one text decoder under a single language modeling task. We also scale up the pre-training data and the model size to boost the model performance. Without bells and whistles, our GIT establishes new state of the arts on 12 challenging benchmarks with a large margin. For instance, our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗microsoft/git-basemodel· 12k dl· ♡ 10612k dl♡ 106
- 🤗microsoft/git-base-cocomodel· 4.0k dl· ♡ 204.0k dl♡ 20
- 🤗microsoft/git-base-textcapsmodel· 262 dl· ♡ 8262 dl♡ 8
- 🤗microsoft/git-base-vqav2model· 138 dl· ♡ 20138 dl♡ 20
- 🤗microsoft/git-base-textvqamodel· 611 dl· ♡ 6611 dl♡ 6
- 🤗microsoft/git-largemodel· 589 dl· ♡ 17589 dl♡ 17
- 🤗microsoft/git-base-vatexmodel· 132 dl· ♡ 4132 dl♡ 4
- 🤗microsoft/git-large-cocomodel· 7.8k dl· ♡ 1047.8k dl♡ 104
- 🤗microsoft/git-large-textcapsmodel· 435 dl· ♡ 30435 dl♡ 30
- 🤗microsoft/git-base-msrvtt-qamodel· 23 dl· ♡ 223 dl♡ 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Natural Language Processing Techniques
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Softmax · Dense Connections · Absolute Position Encodings · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Multi-Head Attention
