BIG-Bench Extra Hard
Mehran Kazemi, Bahare Fatemi, Hritik Bansal, John Palowitch,, Chrysovalantis Anastasiou, Sanket Vaibhav Mehta, Lalit K. Jain, Virginia, Aglietti, Disha Jindal, Peter Chen, Nishanth Dikkala, Gladys Tyen, Xin Liu,, Uri Shalit, Silvia Chiappa, Kate Olszewska, Yi Tay, Vinh Q. Tran

TL;DR
BIG-Bench Extra Hard (BBEH) is a new challenging benchmark designed to evaluate and push the limits of large language models' general reasoning abilities beyond existing benchmarks, revealing significant room for improvement.
Contribution
The paper introduces BBEH, a more difficult benchmark replacing tasks in BBH to better assess LLM reasoning capabilities and identify current limitations.
Findings
Best models achieve only 9.8% accuracy on BBEH
Significant gap between general-purpose and reasoning-specialized models
Highlights ongoing challenges in robust reasoning for LLMs
Abstract
Large language models (LLMs) are increasingly deployed in everyday applications, demanding robust general reasoning capabilities and diverse reasoning skillset. However, current LLM reasoning benchmarks predominantly focus on mathematical and coding abilities, leaving a gap in evaluating broader reasoning proficiencies. One particular exception is the BIG-Bench dataset, which has served as a crucial benchmark for evaluating the general reasoning capabilities of LLMs, thanks to its diverse set of challenging tasks that allowed for a comprehensive assessment of general reasoning across various skills within a unified framework. However, recent advances in LLMs have led to saturation on BIG-Bench, and its harder version BIG-Bench Hard (BBH). State-of-the-art models achieve near-perfect scores on many tasks in BBH, thus diminishing its utility. To address this limitation, we introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsSparse Evolutionary Training · Focus
