Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models
Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang,, Thomas Kollar, Dorsa Sadigh

TL;DR
This paper systematically evaluates and analyzes the design choices of visually-conditioned language models (VLMs), providing standardized benchmarks, insights into their performance factors, and releasing improved models and tools.
Contribution
It introduces a comprehensive evaluation framework, investigates key design decisions, and releases new VLM checkpoints that outperform existing state-of-the-art models.
Findings
VLMs' performance varies significantly with design choices.
Pretrained visual representations and training strategies impact capabilities.
New VLMs at 7-13B scale outperform previous models.
Abstract
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning; adoption that has fueled a wealth of new models such as LLaVa, InstructBLIP, and PaLI-3. Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored, making it challenging to understand what factors account for model performance a challenge further complicated by the lack of objective, consistent evaluations. To address these gaps, we first compile a suite of standardized evaluations spanning visual question answering, object localization, and challenge sets that probe properties such as hallucination; evaluations that provide fine-grained insight VLM capabilities. Second, we rigorously investigate VLMs along key design axes, including pretrained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsBalanced Selection
