R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM   Complex Reasoning Evaluation

Meng-Hao Guo; Jiajun Xu; Yi Zhang; Jiaxi Song; Haoyang Peng; Yi-Xuan; Deng; Xinzhi Dong; Kiyohiro Nakayama; Zhengyang Geng; Chen Wang; Bolin Ni,; Guo-Wei Yang; Yongming Rao; Houwen Peng; Han Hu; Gordon Wetzstein; Shi-min Hu

arXiv:2505.02018·cs.CV·May 6, 2025

R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation

Meng-Hao Guo, Jiajun Xu, Yi Zhang, Jiaxi Song, Haoyang Peng, Yi-Xuan, Deng, Xinzhi Dong, Kiyohiro Nakayama, Zhengyang Geng, Chen Wang, Bolin Ni,, Guo-Wei Yang, Yongming Rao, Houwen Peng, Han Hu, Gordon Wetzstein, Shi-min Hu

PDF

Open Access 1 Datasets

TL;DR

R-Bench is a comprehensive, multi-disciplinary benchmark designed to rigorously evaluate the complex reasoning capabilities of language and multimodal models across multiple subjects and languages, highlighting current models' limitations.

Contribution

The paper introduces R-Bench, a novel graduate-level, multi-disciplinary, bilingual benchmark for assessing complex reasoning in language and multimodal models, with extensive curated questions and cross-linguistic alignment.

Findings

01

Advanced models perform poorly on complex reasoning tasks.

02

Top models achieve only around 53% accuracy on multimodal reasoning.

03

The benchmark reveals significant gaps in current models' reasoning abilities.

Abstract

Reasoning stands as a cornerstone of intelligence, enabling the synthesis of existing knowledge to solve complex problems. Despite remarkable progress, existing reasoning benchmarks often fail to rigorously evaluate the nuanced reasoning capabilities required for complex, real-world problemsolving, particularly in multi-disciplinary and multimodal contexts. In this paper, we introduce a graduate-level, multi-disciplinary, EnglishChinese benchmark, dubbed as Reasoning Bench (R-Bench), for assessing the reasoning capability of both language and multimodal models. RBench spans 1,094 questions across 108 subjects for language model evaluation and 665 questions across 83 subjects for multimodal model testing in both English and Chinese. These questions are meticulously curated to ensure rigorous difficulty calibration, subject balance, and crosslinguistic alignment, enabling the assessment…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

R-Bench/R-Bench
dataset· 646 dl
646 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI-based Problem Solving and Planning · Semantic Web and Ontologies · Artificial Intelligence in Law