PaLI-3 Vision Language Models: Smaller, Faster, Stronger
Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul, Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr, Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic, Keran Rong,, Tianli Yu, Daniel Keysers, Xiaohua Zhai, Radu Soricut

TL;DR
PaLI-3 is a compact, efficient vision language model that outperforms larger models in multimodal tasks, especially in localization and text understanding, by leveraging contrastive pretraining with SigLIP.
Contribution
Introducing PaLI-3, a smaller and faster VLM that uses contrastive SigLIP pretraining, achieving state-of-the-art results with only 5 billion parameters.
Findings
SigLIP-based PaLI outperforms classification-pretrained models in multimodal benchmarks.
Scaling SigLIP to 2 billion parameters improves multilingual cross-modal retrieval.
PaLI-3 demonstrates superior performance despite being significantly smaller.
Abstract
This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger. As part of arriving at this strong performance, we compare Vision Transformer (ViT) models pretrained using classification objectives to contrastively (SigLIP) pretrained ones. We find that, while slightly underperforming on standard image classification benchmarks, SigLIP-based PaLI shows superior performance across various multimodal benchmarks, especially on localization and visually-situated text understanding. We scale the SigLIP image encoder up to 2 billion parameters, and achieves a new state-of-the-art on multilingual cross-modal retrieval. We hope that PaLI-3, at only 5B parameters, rekindles research on fundamental pieces of complex VLMs, and could fuel a new generation of scaled-up models.
Peer Reviews
Decision·Submitted to ICLR 2024
- The conclusion that a contrastively pretrained visual encoder can outperform a classification-pretrained encoder in vision-language tasks, particularly in grounding, is valuable and beneficial to the vision-language community. - Strong performance with much less parameters. - Sufficient in-depth analysis on general tasks and fairness, bias and potential issues are performed to better model understanding.
- The main weakness of PaLI-3, from my perspective, is the way the authors used to draw their conclusion. Specifically, the authors claim that because SigLIP shows better performance than the classification-pretrained visual encoder used by PaLI and PaLI-X, they conclude that a contrastively pretrained visual encoder is superior to a classification-pretrained one. However, it's worth noting that most of the accessible contrastively pretrained visual encoders for the vision and vision-language co
The main strength of the paper is the numerous experiments the authors have carried out and the good results presented. Moreover, the paper is fairly easy to follow.
Unfortunately I don't believe that the claimed contributions (used of SigLIP and increase of resolution) are enough for ICLR. The finding that contrastively pre-trained visual backbone with language supervision works better than training for classification doesn't seem very surprising. Moreover, training follows previous PALI training pipelines so no particular novelty in this regard either. Actually incorporating these improvements could probably benefit any other model compared with the propos
1) Very good results in terms of cost-effectiveness trade-off.Comprehensive evaluation on various benchmarks. 2) The paper is very easy to read and understand. 3) The approach is simple and easy to implement. 4) The effectiveness of SigLIP is very insightful. It seems that such a simple modification can give a significant improvement. It shows the potential of the importance of designing a smarter training objective that aligns better with the language models.
1) I strongly encourage authors to provide more comparisons between CLIP and SigLIP under this paper's setting. The current ablation only includes the comparison between SigLIP and vanilla classification. 2) I understand the paper mainly focuses on a smaller and cheaper model, as stated in the title. However, I think it is important to study the scaling results to check the effectiveness on a larger scale. Can the SigLIP still be so effective when using a larger vision encoder and language model
Code & Models
- 🤗google/paligemma-3b-pt-224model· 86k dl· ♡ 42686k dl♡ 426
- 🤗google/paligemma-3b-mix-448model· 2.9k dl· ♡ 1162.9k dl♡ 116
- 🤗google/paligemma-3b-pt-224-jaxmodel· 205 dl· ♡ 3205 dl♡ 3
- 🤗google/paligemma-3b-pt-448-jaxmodel· 2 dl· ♡ 22 dl♡ 2
- 🤗google/paligemma-3b-pt-896-jaxmodel· ♡ 2♡ 2
- 🤗google/paligemma-3b-ft-aokvqa-mc-448-jaxmodel
- 🤗google/paligemma-3b-ft-textcaps-224-jaxmodel
- 🤗google/paligemma-3b-ft-widgetcap-448-jaxmodel· ♡ 2♡ 2
- 🤗google/paligemma-3b-ft-vqav2-448-jaxmodel· 1 dl· ♡ 21 dl♡ 2
- 🤗google/paligemma-3b-ft-refcoco-seg-448-jaxmodel· ♡ 1♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Image and Video Retrieval Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Byte Pair Encoding · Label Smoothing · Residual Connection · Absolute Position Encodings · Layer Normalization · Dense Connections
