On Path to Multimodal Historical Reasoning: HistBench and HistAgent

Jiahao Qiu; Fulian Xiao; Yimin Wang; Yuchen Mao; Yijia Chen; Xinzhe Juan; Shu Zhang; Siran Wang; Xuan Qi; Tongcheng Zhang; Zixin Yao; Jiacheng Guo; Yifu Lu; Charles Argon; Jundi Cui; Daixin Chen; Junran Zhou; Shuyao Zhou; Zhanpeng Zhou; Ling Yang; Shilong Liu; Hongru Wang; Kaixuan Huang; Xun Jiang; Yuming Cao; Yue Chen; Yunfei Chen; Zhengyi Chen; Ruowei Dai; Mengqiu Deng; Jiye Fu; Yunting Gu; Zijie Guan; Zirui Huang; Xiaoyan Ji; Yumeng Jiang; Delong Kong; Haolong Li; Jiaqi Li; Ruipeng Li; Tianze Li; Zhuoran Li; Haixia Lian; Mengyue Lin; Xudong Liu; Jiayi Lu; Jinghan Lu; Wanyu Luo; Ziyue Luo; Zihao Pu; Zhi Qiao; Ruihuan Ren; Liang Wan; Ruixiang Wang; Tianhui Wang; Yang Wang; Zeyu Wang; Zihua Wang; Yujia Wu; Zhaoyi Wu; Hao Xin; Weiao Xing; Ruojun Xiong; Weijie Xu; Yao Shu; Yao Xiao; Xiaorui Yang; Yuchen Yang; Nan Yi; Jiadong Yu; Yangyuxuan Yu; Huiting Zeng; Danni Zhang; Yunjie Zhang; Zhaoyu Zhang; Zhiheng Zhang; Xiaofeng Zheng; Peirong Zhou; Linyan Zhong; Xiaoyin Zong; Ying Zhao; Zhenxin Chen; Lin Ding; Xiaoyu Gao; Bingbing Gong; Yichao Li; Yang Liao; Guang Ma; Tianyuan Ma; Xinrui Sun; Tianyi Wang; Han Xia; Ruobing Xian; Gen Ye; Tengfei Yu; Wentao Zhang; Yuxi Wang; Xi Gao; Mengdi Wang

arXiv:2505.20246·cs.AI·June 23, 2025

On Path to Multimodal Historical Reasoning: HistBench and HistAgent

Jiahao Qiu, Fulian Xiao, Yimin Wang, Yuchen Mao, Yijia Chen, Xinzhe Juan, Shu Zhang, Siran Wang, Xuan Qi, Tongcheng Zhang, Zixin Yao, Jiacheng Guo, Yifu Lu, Charles Argon, Jundi Cui, Daixin Chen, Junran Zhou, Shuyao Zhou, Zhanpeng Zhou, Ling Yang, Shilong Liu, Hongru Wang

PDF

Open Access 2 Repos 1 Datasets 3 Reviews

TL;DR

This paper introduces HistBench, a comprehensive benchmark for evaluating AI's historical reasoning across multiple modalities and languages, and presents HistAgent, a specialized agent that significantly outperforms generalist models on this benchmark.

Contribution

The paper creates a new challenging benchmark, HistBench, for historical reasoning and develops HistAgent, a domain-specific AI agent with tools tailored for historical analysis.

Findings

01

LLMs perform poorly on HistBench tasks.

02

HistAgent outperforms generalist models significantly.

03

Tools like OCR, translation, and image understanding improve historical reasoning.

Abstract

Recent advances in large language models (LLMs) have led to remarkable progress across domains, yet their capabilities in the humanities, particularly history, remain underexplored. Historical reasoning poses unique challenges for AI, involving multimodal source interpretation, temporal inference, and cross-linguistic analysis. While general-purpose agents perform well on many existing benchmarks, they lack the domain-specific expertise required to engage with historical materials and questions. To address this gap, we introduce HistBench, a new benchmark of 414 high-quality questions designed to evaluate AI's capacity for historical reasoning and authored by more than 40 expert contributors. The tasks span a wide range of historical problems-from factual retrieval based on primary sources to interpretive analysis of manuscripts and images, to interdisciplinary challenges involving…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

- This paper introduces HistBench which is very useful to assess agents' abilities to solve complex historical questions through historical reasoning. This could be beneficial to the history research community. - The proposed agent HistAgent achieves SOTA performances on multiple benchmarks, which could be valuable resource for historians. - The authors performed comprehensive evaluation and analysis, providing valuable insights such as highlighting the importance of tools.

Weaknesses

- The authors didn't discuss existing work in developing agents good at solving historical questions or historical reasoning. It would be good if they could provide some literature review on what other people in the field has developed. - The authors used different base language models (e.g. claude and gpt-4o) across different benchmarks without explaining why. It would be good if the authors could include a brief description of the reason for their choice of models.

Reviewer 02Rating 2Confidence 3

Strengths

1. The paper introduces a nuanced benchmark on historical reasoning which is an under-explored area. 2. The manual verification and difficulty annotation process is very thorough. The authors use a 3-step verification process and create a comprehensive six-question rubric for annotation. 3. The dataset is diverse with 29 languages spanning several regions and decades.

Weaknesses

1. Misleading results are reported in Figure 1 and abstract. According to Figure 1, HistAgent performs better than base models. However, Table 3 highlights o3 and o4-mini are able to achieve much higher accuracies compared to HistAgent. 2. While HistBench is a novel dataset compared to previous works, it is still very small with only 414 questions. Additionally, the data creation pipeline is time-consuming and not very scalable. 3. The better performance of HistAgent on HistBench makes sense sin

Reviewer 03Rating 4Confidence 3

Strengths

Originality: The paper fills an important gap by introducing a domain-specific benchmark for the humanities, focusing on multimodal and multilingual historical reasoning. While earlier efforts like HLE or HiST-LLM covered historical knowledge, none provided this level of depth or tool-grounded evaluation. Quality: The benchmark design is robust, involving domain experts, stratified difficulty levels, and rigorous three-stage quality control (screening, LLM difficulty filtering, and expert revie

Weaknesses

1. The dataset size (414 items) limits statistical granularity across 29 languages, especially for low-resource ones. Scaling beyond the pilot phase would strengthen claims of coverage. 2. The results could include more fine-grained analysis, such as performance by language, modality, or reasoning dimension. 3. The architecture section is somewhat heavy on engineering detail but could benefit from clearer ablation results demonstrating which tools contribute most.

Code & Models

Repositories

Datasets

jiahaoq/HistBench
dataset· 228 dl
228 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Topic Modeling