SPAN: Benchmarking and Improving Cross-Calendar Temporal Reasoning of Large Language Models
Zhongjian Miao, Hao Fu, Chen Wei

TL;DR
This paper introduces SPAN, a comprehensive benchmark for evaluating large language models' ability to perform cross-calendar temporal reasoning, revealing current limitations and proposing a tool-augmented solution that significantly improves performance.
Contribution
The paper presents SPAN, a novel benchmark for cross-calendar temporal reasoning, and develops a tool-augmented Time Agent that substantially enhances LLMs' reasoning accuracy.
Findings
LLMs achieve only 34.5% average accuracy on SPAN.
Time Agent improves accuracy to 95.31%.
Identifies Future-Date Degradation and Calendar Asymmetry Bias as key challenges.
Abstract
We introduce SPAN, a cross-calendar temporal reasoning benchmark, which requires LLMs to perform intra-calendar temporal reasoning and inter-calendar temporal conversion. SPAN features ten cross-calendar temporal reasoning directions, two reasoning types, and two question formats across six calendars. To enable time-variant and contamination-free evaluation, we propose a template-driven protocol for dynamic instance generation that enables assessment on a user-specified Gregorian date. We conduct extensive experiments on both open- and closed-source state-of-the-art (SOTA) LLMs over a range of dates spanning 100 years from 1960 to 2060. Our evaluations show that these LLMs achieve an average accuracy of only 34.5%, with none exceeding 80%, indicating that this task remains challenging. Through in-depth analysis of reasoning types, question formats, and temporal reasoning directions, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
