MDK12-Bench: A Comprehensive Evaluation of Multimodal Large Language Models on Multidisciplinary Exams

Pengfei Zhou; Xiaopeng Peng; Fanrui Zhang; Zhaopan Xu; Jiaxin Ai; Yansheng Qiu; Chuanhao Li; Zhen Li; Ming Li; Yukang Feng; Jianwen Sun; Haoquan Zhang; Zizhen Li; Xiaofeng Mao; Zekai Li; Wangbo Zhao; Kai Wang; Xiaojun Chang; Wenqi Shao; Yang You; and Kaipeng Zhang

arXiv:2508.06851·cs.AI·August 12, 2025

MDK12-Bench: A Comprehensive Evaluation of Multimodal Large Language Models on Multidisciplinary Exams

Pengfei Zhou, Xiaopeng Peng, Fanrui Zhang, Zhaopan Xu, Jiaxin Ai, Yansheng Qiu, Chuanhao Li, Zhen Li, Ming Li, Yukang Feng, Jianwen Sun, Haoquan Zhang, Zizhen Li, Xiaofeng Mao, Zekai Li, Wangbo Zhao, Kai Wang, Xiaojun Chang, Wenqi Shao, Yang You, and Kaipeng Zhang

PDF

Open Access

TL;DR

MDK12-Bench is a large-scale, multidisciplinary benchmark designed to comprehensively evaluate multimodal large language models across various real-world exams, addressing limitations of existing benchmarks and promoting model robustness.

Contribution

Introduces MDK12-Bench, a comprehensive, real-world exam-based benchmark with a dynamic evaluation framework to better assess MLLMs' generalization and reasoning abilities.

Findings

01

Current MLLMs show significant limitations in multiple evaluation dimensions.

02

The benchmark reveals weaknesses in model robustness and generalization.

03

Knowledge-driven reasoning improves problem-solving performance.

Abstract

Multimodal large language models (MLLMs), which integrate language and visual cues for problem-solving, are crucial for advancing artificial general intelligence (AGI). However, current benchmarks for measuring the intelligence of MLLMs suffer from limited scale, narrow coverage, and unstructured knowledge, offering only static and undifferentiated evaluations. To bridge this gap, we introduce MDK12-Bench, a large-scale multidisciplinary benchmark built from real-world K-12 exams spanning six disciplines with 141K instances and 6,225 knowledge points organized in a six-layer taxonomy. Covering five question formats with difficulty and year annotations, it enables comprehensive evaluation to capture the extent to which MLLMs perform over four dimensions: 1) difficulty levels, 2) temporal (cross-year) shifts, 3) contextual shifts, and 4) knowledge-driven reasoning. We propose a novel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Intelligent Tutoring Systems and Adaptive Learning