Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

Wanting Xu; Yang Liu; Langping He; Xucheng Huang; Ling Jiang

arXiv:2405.09215·cs.CV·June 21, 2024·2 cites

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

Wanting Xu, Yang Liu, Langping He, Xucheng Huang, Ling Jiang

PDF

Open Access 4 Repos 1 Models

TL;DR

Xmodel-VLM is a lightweight, efficient multimodal vision language model that achieves comparable performance to larger models, addressing industry cost issues and enabling broader adoption.

Contribution

We present Xmodel-VLM, a novel small-scale multimodal model trained with the LLaVA paradigm, offering high performance with reduced computational requirements.

Findings

01

Achieves performance comparable to larger models on multiple benchmarks.

02

Designed for efficient deployment on consumer GPU servers.

03

Reduces service costs for large-scale multimodal systems.

Abstract

We introduce Xmodel-VLM, a cutting-edge multimodal vision language model. It is designed for efficient deployment on consumer GPU servers. Our work directly confronts a pivotal industry issue by grappling with the prohibitive service costs that hinder the broad adoption of large-scale multimodal systems. Through rigorous training, we have developed a 1B-scale language model from the ground up, employing the LLaVA paradigm for modal alignment. The result, which we call Xmodel-VLM, is a lightweight yet powerful multimodal vision language model. Extensive testing across numerous classic multimodal benchmarks has revealed that despite its smaller size and faster execution, Xmodel-VLM delivers performance comparable to that of larger models. Our model checkpoints and code are publicly available on GitHub at https://github.com/XiaoduoAILab/XmodelVLM.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
XiaoduoAILab/Xmodel_VLM
model· 70 dl· ♡ 13
70 dl♡ 13

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies

Methodstravel james