BIG-Bench Extra Hard

Mehran Kazemi; Bahare Fatemi; Hritik Bansal; John Palowitch,; Chrysovalantis Anastasiou; Sanket Vaibhav Mehta; Lalit K. Jain; Virginia; Aglietti; Disha Jindal; Peter Chen; Nishanth Dikkala; Gladys Tyen; Xin Liu,; Uri Shalit; Silvia Chiappa; Kate Olszewska; Yi Tay; Vinh Q. Tran; Quoc V. Le,; Orhan Firat

arXiv:2502.19187·cs.CL·May 7, 2025

BIG-Bench Extra Hard

Mehran Kazemi, Bahare Fatemi, Hritik Bansal, John Palowitch,, Chrysovalantis Anastasiou, Sanket Vaibhav Mehta, Lalit K. Jain, Virginia, Aglietti, Disha Jindal, Peter Chen, Nishanth Dikkala, Gladys Tyen, Xin Liu,, Uri Shalit, Silvia Chiappa, Kate Olszewska, Yi Tay, Vinh Q. Tran

PDF

Open Access 1 Repo 4 Datasets

TL;DR

BIG-Bench Extra Hard (BBEH) is a new challenging benchmark designed to evaluate and push the limits of large language models' general reasoning abilities beyond existing benchmarks, revealing significant room for improvement.

Contribution

The paper introduces BBEH, a more difficult benchmark replacing tasks in BBH to better assess LLM reasoning capabilities and identify current limitations.

Findings

01

Best models achieve only 9.8% accuracy on BBEH

02

Significant gap between general-purpose and reasoning-specialized models

03

Highlights ongoing challenges in robust reasoning for LLMs

Abstract

Large language models (LLMs) are increasingly deployed in everyday applications, demanding robust general reasoning capabilities and diverse reasoning skillset. However, current LLM reasoning benchmarks predominantly focus on mathematical and coding abilities, leaving a gap in evaluating broader reasoning proficiencies. One particular exception is the BIG-Bench dataset, which has served as a crucial benchmark for evaluating the general reasoning capabilities of LLMs, thanks to its diverse set of challenging tasks that allowed for a comprehensive assessment of general reasoning across various skills within a unified framework. However, recent advances in LLMs have led to saturation on BIG-Bench, and its harder version BIG-Bench Hard (BBH). State-of-the-art models achieve near-perfect scores on many tasks in BBH, thus diminishing its utility. To address this limitation, we introduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-deepmind/bbeh
noneOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques

MethodsSparse Evolutionary Training · Focus