Can LLMs Act as Historians? Evaluating Historical Research Capabilities of LLMs via the Chinese Imperial Examination

Lirong Gao; Zeqing Wang; Yuyan Cai; Jiayi Deng; Yanmei Gu; Yiming Zhang; Jia Zhou; Yanfei Zhang; Junbo Zhao

arXiv:2604.24690·cs.CL·April 28, 2026

Can LLMs Act as Historians? Evaluating Historical Research Capabilities of LLMs via the Chinese Imperial Examination

Lirong Gao, Zeqing Wang, Yuyan Cai, Jiayi Deng, Yanmei Gu, Yiming Zhang, Jia Zhou, Yanfei Zhang, Junbo Zhao

PDF

1 Repo

TL;DR

This paper introduces ProHist-Bench, a new benchmark based on the Chinese Imperial Examination, to evaluate LLMs' ability for complex historical reasoning, revealing current models' significant limitations.

Contribution

The paper presents ProHist-Bench, a comprehensive, expert-curated benchmark for assessing LLMs' historical research skills, addressing a gap in existing evaluation methods.

Findings

01

State-of-the-art LLMs perform poorly on complex historical questions.

02

ProHist-Bench includes 400 questions across eight dynasties with detailed rubrics.

03

The benchmark aims to guide development of more capable domain-specific LLMs.

Abstract

While Large Language Models (LLMs) have increasingly assisted in historical tasks such as text processing, their capacity for professional-level historical reasoning remains underexplored. Existing benchmarks primarily assess basic knowledge breadth or lexical understanding, failing to capture the higher-order skills, such as evidentiary reasoning,that are central to historical research. To fill this gap, we introduce ProHist-Bench, a novel benchmark anchored in the Chinese Imperial Examination (Keju) system, a comprehensive microcosm of East Asian political, social, and intellectual history spanning over 1,300 years. Developed through deep interdisciplinary collaboration, ProHist-Bench features 400 challenging, expert-curated questions across eight dynasties, accompanied by 10,891 fine-grained evaluation rubrics. Through a rigorous evaluation of 18 LLMs, we reveal a significant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

inclusionAI/ABench/tree/main/ProHist-Bench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.