What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?

Gagan Bhatia; Ahmad Muhammad Isa; Maxime Peyrard; Wei Zhao

arXiv:2603.19017·cs.CL·March 20, 2026

What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?

Gagan Bhatia, Ahmad Muhammad Isa, Maxime Peyrard, Wei Zhao

PDF

Open Access 1 Datasets

TL;DR

This paper introduces MultiTempBench, a multilingual benchmark for temporal reasoning, analyzing how tokenisation and internal representations affect LLM performance across languages and calendar systems.

Contribution

It presents a new multilingual temporal reasoning benchmark and investigates the impact of tokenisation quality and internal representations on LLM temporal reasoning.

Findings

01

Tokenisation quality is a bottleneck in low-resource languages and rare calendar formats.

02

High-resource languages are more robust to digit-level token fragmentation.

03

Temporal linearity strongly predicts reasoning performance in high-resource languages.

Abstract

We present MultiTempBench, a multilingual temporal reasoning benchmark spanning three tasks, date arithmetic, time zone conversion, and temporal relation extraction across five languages (English, German, Chinese, Arabic, and Hausa) and multiple calendar conventions (Gregorian, Hijri, and Chinese Lunar). MultiTempBench contains $15, 000$ examples built by translating $750$ curated English questions and expanding each into controlled date-format variants. We evaluate 20 LLMs and introduce the multilingual Date Fragmentation Ratio (mDFR), calibrated with human severity ratings, together with geometric-probing analyses of internal temporal representations. We find tokenisation quality of temporal artefacts is a resource-dependent bottleneck: in low-resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day separation and accuracy collapses, while high-resource…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

gagan3012/MultiTempBench
dataset· 35 dl
35 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare · Constraint Satisfaction and Optimization