MIBench: Evaluating LMMs on Multimodal Interaction

Yu Miao; Zequn Yang; Yake Wei; Ziheng Chen; Haotian Ni; Haodong Duan; Kai Chen; Di Hu

arXiv:2603.13427·cs.CV·March 17, 2026

MIBench: Evaluating LMMs on Multimodal Interaction

Yu Miao, Zequn Yang, Yake Wei, Ziheng Chen, Haotian Ni, Haodong Duan, Kai Chen, Di Hu

PDF

Open Access 1 Datasets

TL;DR

MIBench is a comprehensive benchmark that evaluates the multimodal interaction capabilities of Large Multimodal Models across various tasks, revealing current limitations and guiding future improvements.

Contribution

The paper introduces MIBench, a new benchmark for assessing multimodal interaction in LMMs across multiple cognitive levels and tasks, highlighting existing model deficiencies.

Findings

01

LMMs' multimodal interaction ability is limited despite scaling.

02

Models are easily distracted by textual modalities when processing vision.

03

Native multimodal models show significant deficits in fundamental interaction abilities.

Abstract

In different multimodal scenarios, it needs to integrate and utilize information across modalities in a specific way based on the demands of the task. Different integration ways between modalities are referred to as "multimodal interaction". How well a model handles various multimodal interactions largely characterizes its multimodal ability. In this paper, we introduce MIBench, a comprehensive benchmark designed to evaluate the multimodal interaction capabilities of Large Multimodal Models (LMMs), which formulates each instance as a (con_v , con_t, task) triplet with contexts from vision and text, necessitating that LMMs employ correct forms of multimodal interaction to effectively complete the task. MIBench assesses models from three key aspects: the ability to source information from vision-centric or text-centric cues, and the ability to generate new information from their joint…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Resurrect/MIBench
dataset· 55 dl
55 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Topic Modeling