MM-InstructEval: Zero-Shot Evaluation of (Multimodal) Large Language Models on Multimodal Reasoning Tasks
Xiaocui Yang, Wenfang Wu, Shi Feng, Ming Wang, Daling Wang, Yang Li,, Qi Sun, Yifei Zhang, Xiaoming Fu, Soujanya Poria

TL;DR
This paper introduces MM-InstructEval, a comprehensive zero-shot evaluation framework for assessing multimodal large language models on complex reasoning tasks involving vision and text, with new metrics and extensive benchmarking.
Contribution
The paper presents a novel evaluation framework with innovative metrics for zero-shot assessment of MLLMs on multimodal reasoning, along with extensive benchmarking of 45 models across 16 datasets.
Findings
Identifies key factors affecting multimodal reasoning performance.
Establishes new benchmarks for MLLMs in multimodal tasks.
Provides insights into model architecture and instruction interactions.
Abstract
The emergence of multimodal large language models (MLLMs) has triggered extensive research in model evaluation. While existing evaluation studies primarily focus on unimodal (vision-only) comprehension and reasoning capabilities, they overlook critical assessments of complex multimodal reasoning tasks that require integrated understanding of both visual and textual contexts. Such multimodal tasks present unique challenges, demanding sophisticated reasoning across multiple modalities and deep comprehension of multimodal contexts. In this paper, we present MM-InstructEval, a comprehensive evaluation framework that incorporates diverse metrics to assess model performance across various multimodal reasoning tasks with vision-text contexts. We conduct extensive zero-shot evaluations on 45 models (including 36 MLLMs) across 16 multimodal datasets, encompassing 6 distinct tasks using 10…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
