Benchmarking Large and Small MLLMs
Xuelu Feng, Yunsheng Li, Dongdong Chen, Mei Gao, Mengchen Liu, Junsong, Yuan, Chunming Qiao

TL;DR
This paper systematically benchmarks large and small multimodal language models, revealing that small models can match large ones in some tasks but struggle with complex reasoning, guiding future improvements.
Contribution
It provides a comprehensive evaluation of both large and small MLLMs across multiple capabilities and real-world scenarios, highlighting their strengths and limitations.
Findings
Small MLLMs perform comparably to large models in certain tasks.
Large models excel in complex reasoning and nuanced understanding.
Both model types exhibit common failure cases in specific domains.
Abstract
Large multimodal language models (MLLMs) such as GPT-4V and GPT-4o have achieved remarkable advancements in understanding and generating multimodal content, showcasing superior quality and capabilities across diverse tasks. However, their deployment faces significant challenges, including slow inference, high computational cost, and impracticality for on-device applications. In contrast, the emergence of small MLLMs, exemplified by the LLava-series models and Phi-3-Vision, offers promising alternatives with faster inference, reduced deployment costs, and the ability to handle domain-specific scenarios. Despite their growing presence, the capability boundaries between large and small MLLMs remain underexplored. In this work, we conduct a systematic and comprehensive evaluation to benchmark both small and large MLLMs, spanning general capabilities such as object recognition, temporal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
