Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning
Bahare Fatemi, Mehran Kazemi, Anton Tsitsulin, Karishma Malkan,, Jinyeong Yim, John Palowitch, Sungyong Seo, Jonathan Halcrow, and Bryan, Perozzi

TL;DR
This paper introduces synthetic datasets and an evaluation framework to systematically assess large language models' temporal reasoning abilities, addressing limitations of real-world data and providing insights into model performance on complex temporal tasks.
Contribution
The work presents novel synthetic datasets and an open-source evaluation framework specifically designed for testing LLMs on temporal reasoning, enabling controlled and comprehensive analysis.
Findings
Identified strengths and weaknesses of LLMs in temporal reasoning
Analyzed the impact of question structure and size on performance
Provided datasets for future benchmarking and research
Abstract
Large language models (LLMs) have showcased remarkable reasoning capabilities, yet they remain susceptible to errors, particularly in temporal reasoning tasks involving complex temporal logic. Existing research has explored LLM performance on temporal reasoning using diverse datasets and benchmarks. However, these studies often rely on real-world data that LLMs may have encountered during pre-training or employ anonymization techniques that can inadvertently introduce factual inconsistencies. In this work, we address these limitations by introducing novel synthetic datasets specifically designed to assess LLM temporal reasoning abilities in various scenarios. The diversity of question types across these datasets enables systematic investigation into the impact of the problem structure, size, question type, fact order, and other factors on LLM performance. Our findings provide valuable…
Peer Reviews
Decision·ICLR 2025 Poster
- For the ToT-Semantic dataset, designed to evaluate LLMs on temporal semantics and logic, the authors employ seven graph generation algorithms and develop eight manually crafted question types. This diversity allows the generation of a large volume of synthetic questions, adding rigor to the dataset and covering various temporal reasoning facets. - The study provides detailed insights into the temporal reasoning capabilities of frontier LLMs, including how factors such as graph size, question
- While ToT-Semantic focuses on temporal semantics and logical reasoning, the paper does not clearly explain how the graph generation process ensures the correctness of graph evolution. Specifically, the distinction between generating static graphs and those with temporal dynamics is not addressed, leaving questions about the dataset's fidelity to real-world temporal processes. - In introduction, the paper emphasizes the importance of evaluating LLMs on temporal reasoning but does not clearly
- The proposed ToT benchmark is designed to address the limitations of existing benchmarks by encompassing a wider variety of graph structures and question types, enabling a more nuanced evaluation of LLMs' temporal reasoning abilities - The authors offer an evaluation of temporal reasoning by decoupling it into semantic and arithmetic aspects. This two-pronged approach provides a more detailed analysis of LLM capabilities.
- As mentioned in the limitation section, the benchmark focuses on scenarios where both the start and end times of a fact are mentioned within a single sentence. But real-world temporal information can be spread across multiple sentences or documents. - The authors generate questions using templates, which might not fully capture the complexity and variability of natural language found in real-world temporal reasoning tasks.
- The data synthesis process benefits from the graph-guided control, and could be generalized to many other tasks. - The constructed data are comprehensive and include many perspectives with quality control. - Experiments are extensively conducted on multiple aspects, and provide some insights on future directions.
- Some claims lack of quantitative evidence: - “real-world data that LLMs may have encountered during pre-training or employ anonymization techniques that can inadvertently introduce factual inconsistencies” Could you add some quantitative evidence showing the GPT-4 or Gemini-1.5 Pro baselines have pre-training data contaminations? - “LLMs could even potentially guess the original entities due to their adjacent relations” This also lacks of quantitative evidence. If this is a commonsense
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies
