\$OneMillion-Bench: How Far are Language Agents from Human Experts?

Qianyu Yang; Yang Liu; Jiaqi Li; Jun Bai; Hao Chen; Kaiyuan Chen; Tiliang Duan; Jiayun Dong; Xiaobo Hu; Zixia Jia; Yang Liu; Tao Peng; Yixin Ren; Ran Tian; Zaiyuan Wang; Yanglihong Xiao; Gang Yao; Lingyue Yin; Ge Zhang; Chun Zhang; Jianpeng Jiao; Zilong Zheng; and Yuan Gong

arXiv:2603.07980·cs.LG·March 10, 2026

\$OneMillion-Bench: How Far are Language Agents from Human Experts?

Qianyu Yang, Yang Liu, Jiaqi Li, Jun Bai, Hao Chen, Kaiyuan Chen, Tiliang Duan, Jiayun Dong, Xiaobo Hu, Zixia Jia, Yang Liu, Tao Peng, Yixin Ren, Ran Tian, Zaiyuan Wang, Yanglihong Xiao, Gang Yao, Lingyue Yin, Ge Zhang, Chun Zhang, Jianpeng Jiao, Zilong Zheng, and Yuan Gong

PDF

Open Access 1 Datasets

TL;DR

This paper introduces OneMillion-Bench, a comprehensive and challenging benchmark with 400 expert-curated tasks across multiple domains to evaluate language models' professional reasoning, source retrieval, and decision-making capabilities.

Contribution

It presents a new benchmark that emphasizes real-world professional tasks requiring complex reasoning, source validation, and domain-specific rules, filling gaps in existing evaluation methods.

Findings

01

Benchmark covers Law, Finance, Healthcare, and more.

02

Evaluates factual accuracy, coherence, and practical feasibility.

03

Provides a unified platform for assessing professional language model performance.

Abstract

As language models (LMs) evolve from chat assistants to long-horizon agents capable of multi-step reasoning and tool use, existing benchmarks remain largely confined to structured or exam-style tasks that fall short of real-world professional demands. To this end, we introduce $OneMillion-Bench $OneMillion-Bench, a benchmark of 400 expert-curated tasks spanning Law, Finance, Industry, Healthcare, and Natural Science, built to evaluate agents across economically consequential scenarios. Unlike prior work, the benchmark requires retrieving authoritative sources, resolving conflicting evidence, applying domain-specific rules, and making constraint decisions, where correctness depends as much on the reasoning process as the final answer. We adopt a rubric-based evaluation protocol scoring factual accuracy, logical coherence, practical feasibility, and professional compliance, focused on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

humanlaya-data-lab/OneMillion-Bench
dataset· 563 dl
563 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI) · Topic Modeling