OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie, Xu, Run-Ze Fan, Lyumanshan Ye, Ethan Chern, Yixin Ye, Yikai Zhang, Yuqing, Yang, Ting Wu, Binjie Wang, Shichao Sun, Yang Xiao, Yiyuan Li, Fan Zhou,, Steffi Chern, Yiwei Qin, Yan Ma, Jiadi Su, Yixiu Liu

TL;DR
OlympicArena is a comprehensive benchmark dataset designed to evaluate AI's multi-discipline cognitive reasoning abilities across text and image modalities, revealing current limitations and guiding future superintelligent AI development.
Contribution
The paper introduces OlympicArena, a large-scale, multi-modal benchmark with interdisciplinary problems to evaluate and analyze AI reasoning capabilities comprehensively.
Findings
GPT-4o achieves only 39.97% accuracy on the benchmark.
Current AI models show significant limitations in complex, interdisciplinary reasoning.
OlympicArena provides resources for advancing AI research in scientific discovery.
Abstract
The evolution of Artificial Intelligence (AI) has been significantly accelerated by advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), gradually showcasing potential cognitive reasoning abilities in problem-solving and scientific discovery (i.e., AI4Science) once exclusive to human intellect. To comprehensively evaluate current models' performance in cognitive reasoning abilities, we introduce OlympicArena, which includes 11,163 bilingual problems across both text-only and interleaved text-image modalities. These challenges encompass a wide range of disciplines spanning seven fields and 62 international Olympic competitions, rigorously examined for data leakage. We argue that the challenges in Olympic competition problems are ideal for evaluating AI's cognitive reasoning due to their complexity and interdisciplinary nature, which are essential for tackling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAI-based Problem Solving and Planning
MethodsSparse Evolutionary Training
