GPQA: A Graduate-Level Google-Proof Q&A Benchmark
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard, Yuanzhe Pang, Julien Dirani, Julian Michael, Samuel R. Bowman

TL;DR
GPQA is a challenging, expert-created multiple-choice question dataset in science, designed to be difficult for both humans and AI, to aid in developing scalable oversight methods for AI systems.
Contribution
The paper introduces GPQA, a high-quality, difficult science question dataset that is resistant to Google searches and challenging for state-of-the-art AI, facilitating research on AI oversight.
Findings
Experts achieve 65% accuracy on GPQA questions.
Non-experts reach only 34% accuracy despite extensive web access.
GPT-4 achieves 39% accuracy on GPQA questions.
Abstract
We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are "Google-proof"). The questions are also difficult for state-of-the-art AI systems, with our strongest GPT-4 based baseline achieving 39% accuracy. If we are to use future AI systems to help us answer very hard questions, for example, when developing new scientific knowledge, we need to develop scalable oversight methods that enable humans to supervise their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗google/gemma-3-4b-itmodel· 1.5M dl· ♡ 12721.5M dl♡ 1272
- 🤗google/gemma-3-27b-itmodel· 1.0M dl· ♡ 19401.0M dl♡ 1940
- 🤗unsloth/gemma-3-12b-it-GGUFmodel· 101k dl· ♡ 178101k dl♡ 178
- 🤗google/gemma-3-1b-itmodel· 1.4M dl· ♡ 8991.4M dl♡ 899
- 🤗google/gemma-3-12b-it-qat-q4_0-ggufmodel· 7.1k dl· ♡ 2627.1k dl♡ 262
- 🤗google/gemma-3-270mmodel· 83k dl· ♡ 100383k dl♡ 1003
- 🤗google/gemma-3-12b-itmodel· 2.6M dl· ♡ 6982.6M dl♡ 698
- 🤗google/gemma-3-12b-it-qat-q4_0-unquantizedmodel· 28k dl· ♡ 8128k dl♡ 81
- 🤗p-e-w/gemma-3-12b-it-hereticmodel· 2.4k dl· ♡ 792.4k dl♡ 79
- 🤗llmfan46/gemma-3-12b-it-ultra-uncensored-heretic-GGUFmodel· 23k dl· ♡ 1323k dl♡ 13
Videos
The New, Smartest AI: Claude 3 – Tested vs Gemini 1.5 + GPT-4· youtube
Taxonomy
TopicsMachine Learning and Data Classification · Software Engineering Research · Imbalanced Data Classification Techniques
MethodsAttention Is All You Need · Dense Connections · Dropout · Softmax · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Linear Layer · Adam · Multi-Head Attention
