MANBench: Is Your Multimodal Model Smarter than Human?
Han Zhou, Qitong Xu, Yiheng Dong, Xin Yang

TL;DR
MANBench is a comprehensive bilingual benchmark designed to evaluate multimodal models' abilities across diverse tasks, revealing that current models outperform humans in some areas but lag in complex reasoning and cross-modal understanding.
Contribution
Introduces MANBench, a new bilingual benchmark with 1,314 questions across nine tasks, to rigorously compare human and multimodal model performance.
Findings
MLLMs excel in knowledge and text-image understanding
MLLMs struggle with deep cross-modal reasoning tasks
Both humans and models find complex puzzles challenging
Abstract
The rapid advancement of Multimodal Large Language Models (MLLMs) has ignited discussions regarding their potential to surpass human performance in multimodal tasks. In response, we introduce MANBench (Multimodal Ability Norms Benchmark), a bilingual benchmark (English and Chinese) comprising 1,314 questions across nine tasks, spanning knowledge-based and non-knowledge-based domains. MANBench emphasizes intuitive reasoning, seamless cross-modal integration, and real-world complexity, providing a rigorous evaluation framework. Through extensive human experiments involving diverse participants, we compared human performance against state-of-the-art MLLMs. The results indicate that while MLLMs excel in tasks like Knowledge and Text-Image Understanding, they struggle with deeper cross-modal reasoning tasks such as Transmorphic Understanding, Image Consistency, and Multi-image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Natural Language Processing Techniques
