SPAN: Benchmarking and Improving Cross-Calendar Temporal Reasoning of Large Language Models

Zhongjian Miao; Hao Fu; Chen Wei

arXiv:2511.09993·cs.AI·January 12, 2026

SPAN: Benchmarking and Improving Cross-Calendar Temporal Reasoning of Large Language Models

Zhongjian Miao, Hao Fu, Chen Wei

PDF

Open Access 1 Video

TL;DR

This paper introduces SPAN, a comprehensive benchmark for evaluating large language models' ability to perform cross-calendar temporal reasoning, revealing current limitations and proposing a tool-augmented solution that significantly improves performance.

Contribution

The paper presents SPAN, a novel benchmark for cross-calendar temporal reasoning, and develops a tool-augmented Time Agent that substantially enhances LLMs' reasoning accuracy.

Findings

01

LLMs achieve only 34.5% average accuracy on SPAN.

02

Time Agent improves accuracy to 95.31%.

03

Identifies Future-Date Degradation and Calendar Asymmetry Bias as key challenges.

Abstract

We introduce SPAN, a cross-calendar temporal reasoning benchmark, which requires LLMs to perform intra-calendar temporal reasoning and inter-calendar temporal conversion. SPAN features ten cross-calendar temporal reasoning directions, two reasoning types, and two question formats across six calendars. To enable time-variant and contamination-free evaluation, we propose a template-driven protocol for dynamic instance generation that enables assessment on a user-specified Gregorian date. We conduct extensive experiments on both open- and closed-source state-of-the-art (SOTA) LLMs over a range of dates spanning 100 years from 1960 to 2060. Our evaluations show that these LLMs achieve an average accuracy of only 34.5%, with none exceeding 80%, indicating that this task remains challenging. Through in-depth analysis of reasoning types, question formats, and temporal reasoning directions, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SPAN: Benchmarking and Improving Cross-Calendar Temporal Reasoning of Large Language Models· underline

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques