Measuring Iterative Temporal Reasoning with Time Puzzles

Zhengxiang Wang; Zeyu Dong

arXiv:2601.07148·cs.CL·March 24, 2026

Measuring Iterative Temporal Reasoning with Time Puzzles

Zhengxiang Wang, Zeyu Dong

PDF

Open Access

TL;DR

This paper introduces Time Puzzles, a new benchmark for evaluating iterative temporal reasoning in large language models, highlighting current limitations and the impact of tool use like web search.

Contribution

It presents a novel, algorithmically generated benchmark for temporal reasoning that assesses models' ability to use tools effectively in dynamic scenarios.

Findings

01

GPT-5 achieves 55.3% accuracy without tools

02

Web search improves model performance

03

Explicit date rewriting enhances reasoning accuracy

Abstract

Tool use, such as web search, has become a standard capability even in freely available large language models (LLMs). However, existing benchmarks evaluate temporal reasoning mainly in static, non-tool-using settings, which poorly reflect how LLMs perform temporal reasoning in practice. We introduce Time Puzzles, a constraint-based date inference task for evaluating iterative temporal reasoning with tools. Each puzzle combines factual temporal anchors with (cross-cultural) calendar relations and may admit one or multiple valid dates. The puzzles are algorithmically generated, enabling controlled and continual evaluation. Across 13 LLMs, even the best model (GPT-5) achieves only 55.3% accuracy without tools, despite using easily searchable facts. While web search improves performance, models perform substantially better when constraints are rewritten with explicit dates, removing the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsConstraint Satisfaction and Optimization · Topic Modeling · AI-based Problem Solving and Planning