InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal   Large Language Models

Xiaotian Han; Quanzeng You; Yongfei Liu; Wentao Chen; Huangjie Zheng,; Khalil Mrini; Xudong Lin; Yiqi Wang; Bohan Zhai; Jianbo Yuan; Heng Wang,; Hongxia Yang

arXiv:2311.11567·cs.CV·December 6, 2023·1 cites

InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models

Xiaotian Han, Quanzeng You, Yongfei Liu, Wentao Chen, Huangjie Zheng,, Khalil Mrini, Xudong Lin, Yiqi Wang, Bohan Zhai, Jianbo Yuan, Heng Wang,, Hongxia Yang

PDF

Open Access

TL;DR

This paper introduces InfiMM-Eval, a benchmark dataset designed to evaluate complex open-ended reasoning in multi-modal large language models across deductive, abductive, and analogical reasoning tasks, with an emphasis on intermediate reasoning steps.

Contribution

It presents a manually curated, multi-step reasoning benchmark for MLLMs that emphasizes complex reasoning and intermediate steps, improving upon existing simple evaluation methods.

Findings

01

MLLMs show varied performance on complex reasoning tasks.

02

Intermediate reasoning steps improve evaluation accuracy.

03

Benchmark effectively distinguishes reasoning capabilities of different MLLMs.

Abstract

Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence. These models not only excel in traditional vision-language tasks but also demonstrate impressive performance in contemporary multi-modal benchmarks. Although many of these benchmarks attempt to holistically evaluate MLLMs, they typically concentrate on basic reasoning tasks, often yielding only simple yes/no or multi-choice responses. These methods naturally lead to confusion and difficulties in conclusively determining the reasoning capabilities of MLLMs. To mitigate this issue, we manually curate a benchmark dataset specifically designed for MLLMs, with a focus on complex reasoning tasks. Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning. The queries in our dataset are intentionally constructed to engage the reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling

MethodsALIGN · Focus