Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small   Language Models

Minjie Zhu; Yichen Zhu; Xin Liu; Ning Liu; Zhiyuan Xu; Chaomin Shen,; Yaxin Peng; Zhicai Ou; Feifei Feng; Jian Tang

arXiv:2403.06199·cs.CV·March 26, 2024·1 cites

Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models

Minjie Zhu, Yichen Zhu, Xin Liu, Ning Liu, Zhiyuan Xu, Chaomin Shen,, Yaxin Peng, Zhicai Ou, Feifei Feng, Jian Tang

PDF

Open Access 1 Repo

TL;DR

Mipha introduces an efficient multimodal small language model that rivals large models in visual understanding tasks without increasing training data, making multimodal AI more accessible and cost-effective.

Contribution

The paper presents Mipha, a novel small language model that outperforms larger models on benchmarks, demonstrating a new approach to multimodal AI with reduced computational requirements.

Findings

01

Mipha-3B surpasses state-of-the-art large MLLMs on multiple benchmarks.

02

Without additional training data, Mipha achieves competitive performance.

03

Provides insights for developing effective multimodal small language models.

Abstract

Multimodal Large Language Models (MLLMs) have showcased impressive skills in tasks related to visual understanding and reasoning. Yet, their widespread application faces obstacles due to the high computational demands during both the training and inference phases, restricting their use to a limited audience within the research and user communities. In this paper, we investigate the design aspects of Multimodal Small Language Models (MSLMs) and propose an efficient multimodal assistant named Mipha, which is designed to create synergy among various aspects: visual representation, language models, and optimization strategies. We show that without increasing the volume of training data, our Mipha-3B outperforms the state-of-the-art large MLLMs, especially LLaVA-1.5-13B, on multiple benchmarks. Through detailed discussion, we provide insights and guidelines for developing strong MSLMs that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhuyiche/llava-phi
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems