On Path to Multimodal Historical Reasoning: HistBench and HistAgent
Jiahao Qiu, Fulian Xiao, Yimin Wang, Yuchen Mao, Yijia Chen, Xinzhe Juan, Shu Zhang, Siran Wang, Xuan Qi, Tongcheng Zhang, Zixin Yao, Jiacheng Guo, Yifu Lu, Charles Argon, Jundi Cui, Daixin Chen, Junran Zhou, Shuyao Zhou, Zhanpeng Zhou, Ling Yang, Shilong Liu, Hongru Wang

TL;DR
This paper introduces HistBench, a comprehensive benchmark for evaluating AI's historical reasoning across multiple modalities and languages, and presents HistAgent, a specialized agent that significantly outperforms generalist models on this benchmark.
Contribution
The paper creates a new challenging benchmark, HistBench, for historical reasoning and develops HistAgent, a domain-specific AI agent with tools tailored for historical analysis.
Findings
LLMs perform poorly on HistBench tasks.
HistAgent outperforms generalist models significantly.
Tools like OCR, translation, and image understanding improve historical reasoning.
Abstract
Recent advances in large language models (LLMs) have led to remarkable progress across domains, yet their capabilities in the humanities, particularly history, remain underexplored. Historical reasoning poses unique challenges for AI, involving multimodal source interpretation, temporal inference, and cross-linguistic analysis. While general-purpose agents perform well on many existing benchmarks, they lack the domain-specific expertise required to engage with historical materials and questions. To address this gap, we introduce HistBench, a new benchmark of 414 high-quality questions designed to evaluate AI's capacity for historical reasoning and authored by more than 40 expert contributors. The tasks span a wide range of historical problems-from factual retrieval based on primary sources to interpretive analysis of manuscripts and images, to interdisciplinary challenges involving…
Peer Reviews
Decision·Submitted to ICLR 2026
- This paper introduces HistBench which is very useful to assess agents' abilities to solve complex historical questions through historical reasoning. This could be beneficial to the history research community. - The proposed agent HistAgent achieves SOTA performances on multiple benchmarks, which could be valuable resource for historians. - The authors performed comprehensive evaluation and analysis, providing valuable insights such as highlighting the importance of tools.
- The authors didn't discuss existing work in developing agents good at solving historical questions or historical reasoning. It would be good if they could provide some literature review on what other people in the field has developed. - The authors used different base language models (e.g. claude and gpt-4o) across different benchmarks without explaining why. It would be good if the authors could include a brief description of the reason for their choice of models.
1. The paper introduces a nuanced benchmark on historical reasoning which is an under-explored area. 2. The manual verification and difficulty annotation process is very thorough. The authors use a 3-step verification process and create a comprehensive six-question rubric for annotation. 3. The dataset is diverse with 29 languages spanning several regions and decades.
1. Misleading results are reported in Figure 1 and abstract. According to Figure 1, HistAgent performs better than base models. However, Table 3 highlights o3 and o4-mini are able to achieve much higher accuracies compared to HistAgent. 2. While HistBench is a novel dataset compared to previous works, it is still very small with only 414 questions. Additionally, the data creation pipeline is time-consuming and not very scalable. 3. The better performance of HistAgent on HistBench makes sense sin
Originality: The paper fills an important gap by introducing a domain-specific benchmark for the humanities, focusing on multimodal and multilingual historical reasoning. While earlier efforts like HLE or HiST-LLM covered historical knowledge, none provided this level of depth or tool-grounded evaluation. Quality: The benchmark design is robust, involving domain experts, stratified difficulty levels, and rigorous three-stage quality control (screening, LLM difficulty filtering, and expert revie
1. The dataset size (414 items) limits statistical granularity across 29 languages, especially for low-resource ones. Scaling beyond the pilot phase would strengthen claims of coverage. 2. The results could include more fine-grained analysis, such as performance by language, modality, or reasoning dimension. 3. The architecture section is somewhat heavy on engineering detail but could benefit from clearer ablation results demonstrating which tools contribute most.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Topic Modeling
