Measuring Iterative Temporal Reasoning with Time Puzzles
Zhengxiang Wang, Zeyu Dong

TL;DR
This paper introduces Time Puzzles, a new benchmark for evaluating iterative temporal reasoning in large language models, highlighting current limitations and the impact of tool use like web search.
Contribution
It presents a novel, algorithmically generated benchmark for temporal reasoning that assesses models' ability to use tools effectively in dynamic scenarios.
Findings
GPT-5 achieves 55.3% accuracy without tools
Web search improves model performance
Explicit date rewriting enhances reasoning accuracy
Abstract
Tool use, such as web search, has become a standard capability even in freely available large language models (LLMs). However, existing benchmarks evaluate temporal reasoning mainly in static, non-tool-using settings, which poorly reflect how LLMs perform temporal reasoning in practice. We introduce Time Puzzles, a constraint-based date inference task for evaluating iterative temporal reasoning with tools. Each puzzle combines factual temporal anchors with (cross-cultural) calendar relations and may admit one or multiple valid dates. The puzzles are algorithmically generated, enabling controlled and continual evaluation. Across 13 LLMs, even the best model (GPT-5) achieves only 55.3% accuracy without tools, despite using easily searchable facts. While web search improves performance, models perform substantially better when constraints are rewritten with explicit dates, removing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization · Topic Modeling · AI-based Problem Solving and Planning
