OlympiadBench: A Challenging Benchmark for Promoting AGI with   Olympiad-Level Bilingual Multimodal Scientific Problems

Chaoqun He; Renjie Luo; Yuzhuo Bai; Shengding Hu; Zhen Leng Thai,; Junhao Shen; Jinyi Hu; Xu Han; Yujie Huang; Yuxiang Zhang; Jie Liu; Lei Qi,; Zhiyuan Liu; Maosong Sun

arXiv:2402.14008·cs.CL·June 7, 2024·1 cites

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai,, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi,, Zhiyuan Liu, Maosong Sun

PDF

Open Access 1 Repo 4 Datasets 1 Video

TL;DR

OlympiadBench is a challenging bilingual multimodal benchmark with 8,476 Olympiad-level problems designed to evaluate and advance the reasoning abilities of large AI models, especially in scientific domains like mathematics and physics.

Contribution

This work introduces OlympiadBench, a new rigorous benchmark with expert annotations, to better assess AI models' advanced reasoning and problem-solving skills in scientific competitions.

Findings

01

GPT-4V scores 17.97% on OlympiadBench

02

Physics problems are particularly challenging with 10.74% average score

03

Common issues include hallucinations, knowledge omissions, and logical fallacies

Abstract

Recent advancements have seen Large Language Models (LLMs) and Large Multimodal Models (LMMs) surpassing general human capabilities in various tasks, approaching the proficiency level of human experts across multiple domains. With traditional benchmarks becoming less challenging for these models, new rigorous challenges are essential to gauge their advanced abilities. In this work, we present OlympiadBench, an Olympiad-level bilingual multimodal scientific benchmark, featuring 8,476 problems from Olympiad-level mathematics and physics competitions, including the Chinese college entrance exam. Each problem is detailed with expert-level annotations for step-by-step reasoning. Evaluating top-tier models on OlympiadBench, we implement a comprehensive assessment methodology to accurately evaluate model responses. Notably, the best-performing model, GPT-4V, attains an average score of 17.97%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

openbmb/olympiadbench
noneOfficial

Datasets

Videos

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems· underline

Taxonomy

TopicsEducational Assessment and Pedagogy · Educational Strategies and Epistemologies · Second Language Acquisition and Learning