Pre-training image-language transformers for open-vocabulary tasks

AJ Piergiovanni; Weicheng Kuo; Anelia Angelova

arXiv:2209.04372·cs.CV·September 12, 2022

Pre-training image-language transformers for open-vocabulary tasks

AJ Piergiovanni, Weicheng Kuo, Anelia Angelova

PDF

Open Access 10 Models

TL;DR

This paper introduces a pre-training method for vision-language transformers using diverse tasks and data, leading to significant improvements in open-vocabulary vision-language tasks like VQA, captioning, and visual entailment.

Contribution

It proposes a novel pre-training approach combining image-text captioning and object-aware strategies without extra supervision, enhancing performance on multiple vision-language tasks.

Findings

01

Large gains over standard pre-training methods

02

Effective use of image-text captioning data without additional supervision

03

Improved performance on VQA, captioning, and visual entailment tasks

Abstract

We present a pre-training approach for vision and language transformer models, which is based on a mixture of diverse tasks. We explore both the use of image-text captioning data in pre-training, which does not need additional supervision, as well as object-aware strategies to pre-train the model. We evaluate the method on a number of textgenerative vision+language tasks, such as Visual Question Answering, visual entailment and captioning, and demonstrate large gains over standard pre-training methods.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling