Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models
Minjie Zhu, Yichen Zhu, Xin Liu, Ning Liu, Zhiyuan Xu, Chaomin Shen,, Yaxin Peng, Zhicai Ou, Feifei Feng, Jian Tang

TL;DR
Mipha introduces an efficient multimodal small language model that rivals large models in visual understanding tasks without increasing training data, making multimodal AI more accessible and cost-effective.
Contribution
The paper presents Mipha, a novel small language model that outperforms larger models on benchmarks, demonstrating a new approach to multimodal AI with reduced computational requirements.
Findings
Mipha-3B surpasses state-of-the-art large MLLMs on multiple benchmarks.
Without additional training data, Mipha achieves competitive performance.
Provides insights for developing effective multimodal small language models.
Abstract
Multimodal Large Language Models (MLLMs) have showcased impressive skills in tasks related to visual understanding and reasoning. Yet, their widespread application faces obstacles due to the high computational demands during both the training and inference phases, restricting their use to a limited audience within the research and user communities. In this paper, we investigate the design aspects of Multimodal Small Language Models (MSLMs) and propose an efficient multimodal assistant named Mipha, which is designed to create synergy among various aspects: visual representation, language models, and optimization strategies. We show that without increasing the volume of training data, our Mipha-3B outperforms the state-of-the-art large MLLMs, especially LLaVA-1.5-13B, on multiple benchmarks. Through detailed discussion, we provide insights and guidelines for developing strong MSLMs that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems
