OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for   Superintelligent AI

Zhen Huang; Zengzhi Wang; Shijie Xia; Xuefeng Li; Haoyang Zou; Ruijie; Xu; Run-Ze Fan; Lyumanshan Ye; Ethan Chern; Yixin Ye; Yikai Zhang; Yuqing; Yang; Ting Wu; Binjie Wang; Shichao Sun; Yang Xiao; Yiyuan Li; Fan Zhou,; Steffi Chern; Yiwei Qin; Yan Ma; Jiadi Su; Yixiu Liu; Yuxiang Zheng; Shaoting; Zhang; Dahua Lin; Yu Qiao; Pengfei Liu

arXiv:2406.12753·cs.CL·March 7, 2025·2 cites

OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI

Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie, Xu, Run-Ze Fan, Lyumanshan Ye, Ethan Chern, Yixin Ye, Yikai Zhang, Yuqing, Yang, Ting Wu, Binjie Wang, Shichao Sun, Yang Xiao, Yiyuan Li, Fan Zhou,, Steffi Chern, Yiwei Qin, Yan Ma, Jiadi Su, Yixiu Liu

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

OlympicArena is a comprehensive benchmark dataset designed to evaluate AI's multi-discipline cognitive reasoning abilities across text and image modalities, revealing current limitations and guiding future superintelligent AI development.

Contribution

The paper introduces OlympicArena, a large-scale, multi-modal benchmark with interdisciplinary problems to evaluate and analyze AI reasoning capabilities comprehensively.

Findings

01

GPT-4o achieves only 39.97% accuracy on the benchmark.

02

Current AI models show significant limitations in complex, interdisciplinary reasoning.

03

OlympicArena provides resources for advancing AI research in scientific discovery.

Abstract

The evolution of Artificial Intelligence (AI) has been significantly accelerated by advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), gradually showcasing potential cognitive reasoning abilities in problem-solving and scientific discovery (i.e., AI4Science) once exclusive to human intellect. To comprehensively evaluate current models' performance in cognitive reasoning abilities, we introduce OlympicArena, which includes 11,163 bilingual problems across both text-only and interleaved text-image modalities. These challenges encompass a wide range of disciplines spanning seven fields and 62 international Olympic competitions, rigorously examined for data leakage. We argue that the challenges in Olympic competition problems are ideal for evaluating AI's cognitive reasoning due to their complexity and interdisciplinary nature, which are essential for tackling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gair-nlp/olympicarena
noneOfficial

Datasets

GAIR/OlympicArena
dataset· 209 dl
209 dl

Videos

OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI· slideslive

Taxonomy

TopicsAI-based Problem Solving and Planning

MethodsSparse Evolutionary Training