Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Mirac Suzgun, Nathan Scales, Nathanael Sch\"arli, Sebastian Gehrmann,, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny, Zhou, Jason Wei

TL;DR
This paper evaluates challenging BIG-Bench tasks and demonstrates that chain-of-thought prompting significantly improves language models' performance, surpassing human-rater averages on many tasks, especially with larger models.
Contribution
It introduces the BIG-Bench Hard subset and shows that chain-of-thought prompting enhances model performance beyond previous few-shot methods, revealing emergent capabilities.
Findings
CoT prompting enables models to outperform humans on many BBH tasks.
Larger models benefit more from CoT, showing emergent performance.
Few-shot prompting without CoT underestimates model capabilities.
Abstract
BIG-Bench (Srivastava et al., 2022) is a diverse evaluation suite that focuses on tasks believed to be beyond the capabilities of current language models. Language models have already made good progress on this benchmark, with the best model in the BIG-Bench paper outperforming average reported human-rater results on 65% of the BIG-Bench tasks via few-shot prompting. But on what tasks do language models fall short of average human-rater performance, and are those tasks actually unsolvable by current language models? In this work, we focus on a suite of 23 challenging BIG-Bench tasks which we call BIG-Bench Hard (BBH). These are the task for which prior language model evaluations did not outperform the average human-rater. We find that applying chain-of-thought (CoT) prompting to BBH tasks enables PaLM to surpass the average human-rater performance on 10 of the 23 tasks, and Codex…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗deeplang-ai/LingoWhale-8Bmodel· 147 dl· ♡ 21147 dl♡ 21
- 🤗akjindal53244/Llama-3.1-Storm-8Bmodel· 2.2k dl· ♡ 1772.2k dl♡ 177
- 🤗akjindal53244/Llama-3.1-Storm-8B-FP8-Dynamicmodel· 9 dl· ♡ 149 dl♡ 14
- 🤗akjindal53244/Llama-3.1-Storm-8B-GGUFmodel· 237 dl· ♡ 41237 dl♡ 41
- 🤗RichardErkhov/akjindal53244_-_Llama-3.1-Storm-8B-ggufmodel· 102 dl· ♡ 2102 dl♡ 2
- 🤗QuantFactory/Llama-3.1-Storm-8B-GGUFmodel· 37 dl· ♡ 237 dl♡ 2
- 🤗unsloth/Llama-3.1-Storm-8Bmodel· 14 dl· ♡ 314 dl♡ 3
- 🤗unsloth/Llama-3.1-Storm-8B-bnb-4bitmodel· 17 dl· ♡ 717 dl♡ 7
- 🤗EpistemeAI2/FireStorm-Llama-3.1-8Bmodel· 6 dl· ♡ 26 dl♡ 2
- 🤗QuantFactory/FireStorm-Llama-3.1-8B-GGUFmodel· 20 dl· ♡ 220 dl♡ 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education
MethodsPathways Language Model
