LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact   Language Model

Musashi Hinck; Matthew L. Olson; David Cobbley; Shao-Yen Tseng,; Vasudev Lal

arXiv:2404.01331·cs.CL·June 12, 2024·1 cites

LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model

Musashi Hinck, Matthew L. Olson, David Cobbley, Shao-Yen Tseng,, Vasudev Lal

PDF

Open Access 2 Repos 2 Models

TL;DR

LLaVA-Gemma explores the use of a compact 2B parameter Gemma LLM to accelerate multimodal foundation models, analyzing design choices and providing publicly available training resources.

Contribution

This work introduces LLaVA-Gemma, a framework combining Gemma LLMs with multimodal models, and evaluates the impact of various design features on performance.

Findings

01

Skipping pretraining reduces performance

02

Larger vision models can improve results

03

Increasing language model size has mixed effects

Abstract

We train a suite of multimodal foundation models (MMFM) using the popular LLaVA framework with the recently released Gemma family of large language models (LLMs). Of particular interest is the 2B parameter Gemma model, which provides opportunities to construct capable small-scale MMFMs. In line with findings from other papers in this space, we test the effect of ablating three design features: pretraining the connector, utilizing a more powerful image backbone, and increasing the size of the language backbone. The resulting models, which we call LLaVA-Gemma, exhibit moderate performance on an array of evaluations, but fail to improve past the current comparably sized SOTA models. Closer analysis of performance shows mixed effects; skipping pretraining tends to reduce performance, larger vision models sometimes improve performance, and increasing language model size has inconsistent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Speech and dialogue systems