MM-InstructEval: Zero-Shot Evaluation of (Multimodal) Large Language   Models on Multimodal Reasoning Tasks

Xiaocui Yang; Wenfang Wu; Shi Feng; Ming Wang; Daling Wang; Yang Li,; Qi Sun; Yifei Zhang; Xiaoming Fu; Soujanya Poria

arXiv:2405.07229·cs.MM·April 24, 2025

MM-InstructEval: Zero-Shot Evaluation of (Multimodal) Large Language Models on Multimodal Reasoning Tasks

Xiaocui Yang, Wenfang Wu, Shi Feng, Ming Wang, Daling Wang, Yang Li,, Qi Sun, Yifei Zhang, Xiaoming Fu, Soujanya Poria

PDF

Open Access 1 Repo

TL;DR

This paper introduces MM-InstructEval, a comprehensive zero-shot evaluation framework for assessing multimodal large language models on complex reasoning tasks involving vision and text, with new metrics and extensive benchmarking.

Contribution

The paper presents a novel evaluation framework with innovative metrics for zero-shot assessment of MLLMs on multimodal reasoning, along with extensive benchmarking of 45 models across 16 datasets.

Findings

01

Identifies key factors affecting multimodal reasoning performance.

02

Establishes new benchmarks for MLLMs in multimodal tasks.

03

Provides insights into model architecture and instruction interactions.

Abstract

The emergence of multimodal large language models (MLLMs) has triggered extensive research in model evaluation. While existing evaluation studies primarily focus on unimodal (vision-only) comprehension and reasoning capabilities, they overlook critical assessments of complex multimodal reasoning tasks that require integrated understanding of both visual and textual contexts. Such multimodal tasks present unique challenges, demanding sophisticated reasoning across multiple modalities and deep comprehension of multimodal contexts. In this paper, we present MM-InstructEval, a comprehensive evaluation framework that incorporates diverse metrics to assess model performance across various multimodal reasoning tasks with vision-text contexts. We conduct extensive zero-shot evaluations on 45 models (including 36 MLLMs) across 16 multimodal datasets, encompassing 6 distinct tasks using 10…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

declare-lab/MM-InstructEval
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems