Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

Bahare Fatemi; Mehran Kazemi; Anton Tsitsulin; Karishma Malkan,; Jinyeong Yim; John Palowitch; Sungyong Seo; Jonathan Halcrow; and Bryan; Perozzi

arXiv:2406.09170·cs.CL·June 14, 2024·2 cites

Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

Bahare Fatemi, Mehran Kazemi, Anton Tsitsulin, Karishma Malkan,, Jinyeong Yim, John Palowitch, Sungyong Seo, Jonathan Halcrow, and Bryan, Perozzi

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

This paper introduces synthetic datasets and an evaluation framework to systematically assess large language models' temporal reasoning abilities, addressing limitations of real-world data and providing insights into model performance on complex temporal tasks.

Contribution

The work presents novel synthetic datasets and an open-source evaluation framework specifically designed for testing LLMs on temporal reasoning, enabling controlled and comprehensive analysis.

Findings

01

Identified strengths and weaknesses of LLMs in temporal reasoning

02

Analyzed the impact of question structure and size on performance

03

Provided datasets for future benchmarking and research

Abstract

Large language models (LLMs) have showcased remarkable reasoning capabilities, yet they remain susceptible to errors, particularly in temporal reasoning tasks involving complex temporal logic. Existing research has explored LLM performance on temporal reasoning using diverse datasets and benchmarks. However, these studies often rely on real-world data that LLMs may have encountered during pre-training or employ anonymization techniques that can inadvertently introduce factual inconsistencies. In this work, we address these limitations by introducing novel synthetic datasets specifically designed to assess LLM temporal reasoning abilities in various scenarios. The diversity of question types across these datasets enables systematic investigation into the impact of the problem structure, size, question type, fact order, and other factors on LLM performance. Our findings provide valuable…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

- For the ToT-Semantic dataset, designed to evaluate LLMs on temporal semantics and logic, the authors employ seven graph generation algorithms and develop eight manually crafted question types. This diversity allows the generation of a large volume of synthetic questions, adding rigor to the dataset and covering various temporal reasoning facets. - The study provides detailed insights into the temporal reasoning capabilities of frontier LLMs, including how factors such as graph size, question

Weaknesses

- While ToT-Semantic focuses on temporal semantics and logical reasoning, the paper does not clearly explain how the graph generation process ensures the correctness of graph evolution. Specifically, the distinction between generating static graphs and those with temporal dynamics is not addressed, leaving questions about the dataset's fidelity to real-world temporal processes. - In introduction, the paper emphasizes the importance of evaluating LLMs on temporal reasoning but does not clearly

Reviewer 02Rating 6Confidence 4

Strengths

- The proposed ToT benchmark is designed to address the limitations of existing benchmarks by encompassing a wider variety of graph structures and question types, enabling a more nuanced evaluation of LLMs' temporal reasoning abilities - The authors offer an evaluation of temporal reasoning by decoupling it into semantic and arithmetic aspects. This two-pronged approach provides a more detailed analysis of LLM capabilities.

Weaknesses

- As mentioned in the limitation section, the benchmark focuses on scenarios where both the start and end times of a fact are mentioned within a single sentence. But real-world temporal information can be spread across multiple sentences or documents. - The authors generate questions using templates, which might not fully capture the complexity and variability of natural language found in real-world temporal reasoning tasks.

Reviewer 03Rating 8Confidence 4

Strengths

- The data synthesis process benefits from the graph-guided control, and could be generalized to many other tasks. - The constructed data are comprehensive and include many perspectives with quality control. - Experiments are extensively conducted on multiple aspects, and provide some insights on future directions.

Weaknesses

- Some claims lack of quantitative evidence: - “real-world data that LLMs may have encountered during pre-training or employ anonymization techniques that can inadvertently introduce factual inconsistencies” Could you add some quantitative evidence showing the GPT-4 or Gemini-1.5 Pro baselines have pre-training data contaminations? - “LLMs could even potentially guess the original entities due to their adjacent relations” This also lacks of quantitative evidence. If this is a commonsense

Code & Models

Datasets

baharef/ToT
dataset· 62 dl
62 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies