TL;DR
This paper introduces ProHist-Bench, a new benchmark based on the Chinese Imperial Examination, to evaluate LLMs' ability for complex historical reasoning, revealing current models' significant limitations.
Contribution
The paper presents ProHist-Bench, a comprehensive, expert-curated benchmark for assessing LLMs' historical research skills, addressing a gap in existing evaluation methods.
Findings
State-of-the-art LLMs perform poorly on complex historical questions.
ProHist-Bench includes 400 questions across eight dynasties with detailed rubrics.
The benchmark aims to guide development of more capable domain-specific LLMs.
Abstract
While Large Language Models (LLMs) have increasingly assisted in historical tasks such as text processing, their capacity for professional-level historical reasoning remains underexplored. Existing benchmarks primarily assess basic knowledge breadth or lexical understanding, failing to capture the higher-order skills, such as evidentiary reasoning,that are central to historical research. To fill this gap, we introduce ProHist-Bench, a novel benchmark anchored in the Chinese Imperial Examination (Keju) system, a comprehensive microcosm of East Asian political, social, and intellectual history spanning over 1,300 years. Developed through deep interdisciplinary collaboration, ProHist-Bench features 400 challenging, expert-curated questions across eight dynasties, accompanied by 10,891 fine-grained evaluation rubrics. Through a rigorous evaluation of 18 LLMs, we reveal a significant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
