MIBench: Evaluating LMMs on Multimodal Interaction
Yu Miao, Zequn Yang, Yake Wei, Ziheng Chen, Haotian Ni, Haodong Duan, Kai Chen, Di Hu

TL;DR
MIBench is a comprehensive benchmark that evaluates the multimodal interaction capabilities of Large Multimodal Models across various tasks, revealing current limitations and guiding future improvements.
Contribution
The paper introduces MIBench, a new benchmark for assessing multimodal interaction in LMMs across multiple cognitive levels and tasks, highlighting existing model deficiencies.
Findings
LMMs' multimodal interaction ability is limited despite scaling.
Models are easily distracted by textual modalities when processing vision.
Native multimodal models show significant deficits in fundamental interaction abilities.
Abstract
In different multimodal scenarios, it needs to integrate and utilize information across modalities in a specific way based on the demands of the task. Different integration ways between modalities are referred to as "multimodal interaction". How well a model handles various multimodal interactions largely characterizes its multimodal ability. In this paper, we introduce MIBench, a comprehensive benchmark designed to evaluate the multimodal interaction capabilities of Large Multimodal Models (LMMs), which formulates each instance as a (con_v , con_t, task) triplet with contexts from vision and text, necessitating that LMMs employ correct forms of multimodal interaction to effectively complete the task. MIBench assesses models from three key aspects: the ability to source information from vision-centric or text-centric cues, and the ability to generate new information from their joint…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Topic Modeling
