Eve: Efficient Multimodal Vision Language Models with Elastic Visual   Experts

Miao Rang; Zhenni Bi; Chuanjian Liu; Yehui Tang; Kai Han; Yunhe Wang

arXiv:2501.04322·cs.CV·January 24, 2025

Eve: Efficient Multimodal Vision Language Models with Elastic Visual Experts

Miao Rang, Zhenni Bi, Chuanjian Liu, Yehui Tang, Kai Han, Yunhe Wang

PDF

Open Access 1 Repo 1 Video

TL;DR

Eve is a versatile, efficient vision-language model with 1.8B parameters that balances multimodal and linguistic capabilities, outperforming larger models in benchmarks and enabling edge device deployment.

Contribution

Eve introduces a novel framework with elastic visual experts that maintains linguistic skills while enhancing multimodal performance in a compact model.

Findings

01

Outperforms larger models in language benchmarks.

02

Achieves 68.87% on VLM benchmarks.

03

Outperforms 7B LLaVA-1.5 in multimodal accuracy.

Abstract

Multimodal vision language models (VLMs) have made significant progress with the support of continuously increasing model sizes and data volumes. Running VLMs on edge devices has become a challenge for their widespread application. There are several efficient VLM efforts, but they often sacrifice linguistic capabilities to enhance multimodal abilities, or require extensive training. To address this quandary,we introduce the innovative framework of Efficient Vision Language Models with Elastic Visual Experts (Eve). By strategically incorporating adaptable visual expertise at multiple stages of training, Eve strikes a balance between preserving linguistic abilities and augmenting multimodal capabilities. This balanced approach results in a versatile model with only 1.8B parameters that delivers significant improvements in both multimodal and linguistic tasks. Notably, in configurations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rangmiao/eve
pytorchOfficial

Videos

Eve: Efficient Multimodal Vision Language Models with Elastic Visual Experts· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques