What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?
Gagan Bhatia, Ahmad Muhammad Isa, Maxime Peyrard, Wei Zhao

TL;DR
This paper introduces MultiTempBench, a multilingual benchmark for temporal reasoning, analyzing how tokenisation and internal representations affect LLM performance across languages and calendar systems.
Contribution
It presents a new multilingual temporal reasoning benchmark and investigates the impact of tokenisation quality and internal representations on LLM temporal reasoning.
Findings
Tokenisation quality is a bottleneck in low-resource languages and rare calendar formats.
High-resource languages are more robust to digit-level token fragmentation.
Temporal linearity strongly predicts reasoning performance in high-resource languages.
Abstract
We present MultiTempBench, a multilingual temporal reasoning benchmark spanning three tasks, date arithmetic, time zone conversion, and temporal relation extraction across five languages (English, German, Chinese, Arabic, and Hausa) and multiple calendar conventions (Gregorian, Hijri, and Chinese Lunar). MultiTempBench contains examples built by translating curated English questions and expanding each into controlled date-format variants. We evaluate 20 LLMs and introduce the multilingual Date Fragmentation Ratio (mDFR), calibrated with human severity ratings, together with geometric-probing analyses of internal temporal representations. We find tokenisation quality of temporal artefacts is a resource-dependent bottleneck: in low-resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day separation and accuracy collapses, while high-resource…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Healthcare · Constraint Satisfaction and Optimization
