LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model

Yichen Zhu; Minjie Zhu; Ning Liu; Zhicai Ou; Xiaofeng Mou; Jian Tang

arXiv:2401.02330·cs.CV·February 23, 2024·1 cites

LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model

Yichen Zhu, Minjie Zhu, Ning Liu, Zhicai Ou, Xiaofeng Mou, Jian Tang

PDF

Open Access 1 Repo 9 Models

TL;DR

LLaVA-Phi demonstrates that small language models with high-quality training can effectively perform multi-modal dialogues, combining visual and textual understanding with resource efficiency for real-time applications.

Contribution

The paper introduces LLaVA-Phi, a multi-modal assistant leveraging a small language model, Phi-2, to achieve high performance in multi-modal dialogue tasks with fewer parameters.

Findings

01

Effective multi-modal dialogue with 2.7B parameters

02

Competent performance on visual comprehension benchmarks

03

Enables real-time, resource-efficient multi-modal systems

Abstract

In this paper, we introduce LLaVA- $ϕ$ (LLaVA-Phi), an efficient multi-modal assistant that harnesses the power of the recently advanced small language model, Phi-2, to facilitate multi-modal dialogues. LLaVA-Phi marks a notable advancement in the realm of compact multi-modal models. It demonstrates that even smaller language models, with as few as 2.7B parameters, can effectively engage in intricate dialogues that integrate both textual and visual elements, provided they are trained with high-quality corpora. Our model delivers commendable performance on publicly available benchmarks that encompass visual comprehension, reasoning, and knowledge-based perception. Beyond its remarkable performance in multi-modal dialogue tasks, our model opens new avenues for applications in time-sensitive environments and systems that require real-time interaction, such as embodied agents. It…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhuyiche/llava-phi
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Speech and dialogue systems