SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu, Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie, Zheng, Kaixin Deng, Shawn Gavin, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li,, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, David Ma

TL;DR
SuperGPQA is a new benchmark that evaluates LLMs across 285 specialized academic disciplines, revealing significant performance gaps and providing methodological insights for large-scale knowledge assessment.
Contribution
The paper introduces SuperGPQA, a comprehensive, multi-disciplinary benchmark with a novel collaborative filtering approach for question refinement, expanding evaluation scope beyond mainstream fields.
Findings
Current LLMs perform poorly across many disciplines.
DeepSeek-R1 achieved 61.82% accuracy on SuperGPQA.
Large-scale annotation involved over 80 experts.
Abstract
Large language models (LLMs) have demonstrated remarkable proficiency in mainstream academic disciplines such as mathematics, physics, and computer science. However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks. The capabilities of LLMs in many of these specialized fields-particularly in light industry, agriculture, and service-oriented disciplines-remain inadequately evaluated. To address this gap, we present SuperGPQA, a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines. Our benchmark employs a novel Human-LLM collaborative filtering mechanism to eliminate trivial or ambiguous questions through iterative refinement based on both LLM responses and expert feedback. Our experimental results reveal significant room for improvement in the performance of current…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law
