LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model
Musashi Hinck, Matthew L. Olson, David Cobbley, Shao-Yen Tseng,, Vasudev Lal

TL;DR
LLaVA-Gemma explores the use of a compact 2B parameter Gemma LLM to accelerate multimodal foundation models, analyzing design choices and providing publicly available training resources.
Contribution
This work introduces LLaVA-Gemma, a framework combining Gemma LLMs with multimodal models, and evaluates the impact of various design features on performance.
Findings
Skipping pretraining reduces performance
Larger vision models can improve results
Increasing language model size has mixed effects
Abstract
We train a suite of multimodal foundation models (MMFM) using the popular LLaVA framework with the recently released Gemma family of large language models (LLMs). Of particular interest is the 2B parameter Gemma model, which provides opportunities to construct capable small-scale MMFMs. In line with findings from other papers in this space, we test the effect of ablating three design features: pretraining the connector, utilizing a more powerful image backbone, and increasing the size of the language backbone. The resulting models, which we call LLaVA-Gemma, exhibit moderate performance on an array of evaluations, but fail to improve past the current comparably sized SOTA models. Closer analysis of performance shows mixed effects; skipping pretraining tends to reduce performance, larger vision models sometimes improve performance, and increasing language model size has inconsistent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Speech and dialogue systems
