LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model
Yichen Zhu, Minjie Zhu, Ning Liu, Zhicai Ou, Xiaofeng Mou, Jian Tang

TL;DR
LLaVA-Phi demonstrates that small language models with high-quality training can effectively perform multi-modal dialogues, combining visual and textual understanding with resource efficiency for real-time applications.
Contribution
The paper introduces LLaVA-Phi, a multi-modal assistant leveraging a small language model, Phi-2, to achieve high performance in multi-modal dialogue tasks with fewer parameters.
Findings
Effective multi-modal dialogue with 2.7B parameters
Competent performance on visual comprehension benchmarks
Enables real-time, resource-efficient multi-modal systems
Abstract
In this paper, we introduce LLaVA- (LLaVA-Phi), an efficient multi-modal assistant that harnesses the power of the recently advanced small language model, Phi-2, to facilitate multi-modal dialogues. LLaVA-Phi marks a notable advancement in the realm of compact multi-modal models. It demonstrates that even smaller language models, with as few as 2.7B parameters, can effectively engage in intricate dialogues that integrate both textual and visual elements, provided they are trained with high-quality corpora. Our model delivers commendable performance on publicly available benchmarks that encompass visual comprehension, reasoning, and knowledge-based perception. Beyond its remarkable performance in multi-modal dialogue tasks, our model opens new avenues for applications in time-sensitive environments and systems that require real-time interaction, such as embodied agents. It…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗RaviNaik/Llava-Phi2model· 121 dl· ♡ 5121 dl♡ 5
- 🤗GunaKoppula/Llava-Phi2model· 1 dl1 dl
- 🤗Navyabhat/Llava-Phi2model· 10 dl· ♡ 110 dl♡ 1
- 🤗marianna13/llava-phi-2-3bmodel· 24 dl· ♡ 1424 dl♡ 14
- 🤗sujitvasanth/vikhyatk-moondream1oldmodel
- 🤗marianna13/llava-phi-2-3b-siglipmodel· 9 dl· ♡ 39 dl♡ 3
- 🤗marianna13/llava-phi-2-3b-GGUFmodel· 53 dl· ♡ 453 dl♡ 4
- 🤗kejcao/llava-phi-2-GGUFmodel· 123 dl· ♡ 3123 dl♡ 3
- 🤗sid819/Llava-Phi2model· 4 dl4 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Speech and dialogue systems
