MANBench: Is Your Multimodal Model Smarter than Human?

Han Zhou; Qitong Xu; Yiheng Dong; Xin Yang

arXiv:2506.11080·cs.CL·June 16, 2025

MANBench: Is Your Multimodal Model Smarter than Human?

Han Zhou, Qitong Xu, Yiheng Dong, Xin Yang

PDF

Open Access 1 Repo 1 Datasets

TL;DR

MANBench is a comprehensive bilingual benchmark designed to evaluate multimodal models' abilities across diverse tasks, revealing that current models outperform humans in some areas but lag in complex reasoning and cross-modal understanding.

Contribution

Introduces MANBench, a new bilingual benchmark with 1,314 questions across nine tasks, to rigorously compare human and multimodal model performance.

Findings

01

MLLMs excel in knowledge and text-image understanding

02

MLLMs struggle with deep cross-modal reasoning tasks

03

Both humans and models find complex puzzles challenging

Abstract

The rapid advancement of Multimodal Large Language Models (MLLMs) has ignited discussions regarding their potential to surpass human performance in multimodal tasks. In response, we introduce MANBench (Multimodal Ability Norms Benchmark), a bilingual benchmark (English and Chinese) comprising 1,314 questions across nine tasks, spanning knowledge-based and non-knowledge-based domains. MANBench emphasizes intuitive reasoning, seamless cross-modal integration, and real-world complexity, providing a rigorous evaluation framework. Through extensive human experiments involving diverse participants, we compared human performance against state-of-the-art MLLMs. The results indicate that while MLLMs excel in tasks like Knowledge and Text-Image Understanding, they struggle with deeper cross-modal reasoning tasks such as Transmorphic Understanding, Image Consistency, and Multi-image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

micdz/manbench
noneOfficial

Datasets

MANBench/MANBench
dataset· 16 dl
16 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Natural Language Processing Techniques