Pre-training image-language transformers for open-vocabulary tasks
AJ Piergiovanni, Weicheng Kuo, Anelia Angelova

TL;DR
This paper introduces a pre-training method for vision-language transformers using diverse tasks and data, leading to significant improvements in open-vocabulary vision-language tasks like VQA, captioning, and visual entailment.
Contribution
It proposes a novel pre-training approach combining image-text captioning and object-aware strategies without extra supervision, enhancing performance on multiple vision-language tasks.
Findings
Large gains over standard pre-training methods
Effective use of image-text captioning data without additional supervision
Improved performance on VQA, captioning, and visual entailment tasks
Abstract
We present a pre-training approach for vision and language transformer models, which is based on a mixture of diverse tasks. We explore both the use of image-text captioning data in pre-training, which does not need additional supervision, as well as object-aware strategies to pre-train the model. We evaluate the method on a number of textgenerative vision+language tasks, such as Visual Question Answering, visual entailment and captioning, and demonstrate large gains over standard pre-training methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗google/paligemma-3b-pt-224model· 86k dl· ♡ 42686k dl♡ 426
- 🤗google/paligemma-3b-mix-448model· 2.9k dl· ♡ 1162.9k dl♡ 116
- 🤗google/paligemma-3b-pt-224-jaxmodel· 205 dl· ♡ 3205 dl♡ 3
- 🤗google/paligemma-3b-pt-448-jaxmodel· 2 dl· ♡ 22 dl♡ 2
- 🤗google/paligemma-3b-pt-896-jaxmodel· ♡ 2♡ 2
- 🤗google/paligemma-3b-ft-aokvqa-mc-448-jaxmodel
- 🤗google/paligemma-3b-ft-textcaps-224-jaxmodel
- 🤗google/paligemma-3b-ft-widgetcap-448-jaxmodel· ♡ 2♡ 2
- 🤗google/paligemma-3b-ft-vqav2-448-jaxmodel· 1 dl· ♡ 21 dl♡ 2
- 🤗google/paligemma-3b-ft-refcoco-seg-448-jaxmodel· ♡ 1♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
