HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam

Weiqi Zhai; Zhihai Wang; Jinghang Wang; Boyu Yang; Xiaogang Li; Xander Xu; Bohan Wang; Peng Wang; Xingzhe Wu; Anfeng Li; Qiyuan Feng; Yuhao Zhou; Shoulin Han; Wenjie Luo; Yiyuan Li; Yaxuan Wang; Ruixian Luo; Guojie Lin; Peiyao Xiao; Chengliang Xu; Ben Wang; Zeyu Wang; Zichao Chen; Jianan Ye; Yijie Hu; Jialong Chen; Zongwen Shen; Yuliang Xu; An Yang; Bowen Yu; Dayiheng Liu; Junyang Lin; Hu Wei; Que Shen; and Bing Zhao

arXiv:2602.13964·cs.CL·March 2, 2026

HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam

Weiqi Zhai, Zhihai Wang, Jinghang Wang, Boyu Yang, Xiaogang Li, Xander Xu, Bohan Wang, Peng Wang, Xingzhe Wu, Anfeng Li, Qiyuan Feng, Yuhao Zhou, Shoulin Han, Wenjie Luo, Yiyuan Li, Yaxuan Wang, Ruixian Luo, Guojie Lin, Peiyao Xiao, Chengliang Xu, Ben Wang, Zeyu Wang

PDF

Open Access 2 Datasets

TL;DR

HLE-Verified is a meticulously verified and revised version of the Humanity's Last Exam benchmark, reducing noise and errors to enable more accurate evaluation of large language models across multiple domains.

Contribution

We developed a transparent, two-stage validation and revision process to create a certified, cleaner benchmark for evaluating language models, addressing issues in the original HLE.

Findings

01

Model accuracy improved by 7-10 percentage points on HLE-Verified.

02

Significant accuracy gains (30-40%) on items with original errors.

03

Reduced annotation noise leads to more faithful model capability measurement.

Abstract

Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions. However, community-led analyses have raised concerns that HLE contains a non-trivial number of noisy items, which can bias evaluation results and distort cross-model comparisons. To address this challenge, we introduce HLE-Verified, a verified and revised version of HLE with a transparent verification protocol and fine-grained error taxonomy. Our construction follows a two-stage validation-and-repair workflow resulting in a certified benchmark. In Stage I, each item undergoes binary validation of the problem and final answer through domain-expert review and model-based cross-checks, yielding 668 verified items. In Stage II, flawed but fixable items are revised under strict constraints preserving the original evaluation intent, through dual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Academic integrity and plagiarism