TRAM: Benchmarking Temporal Reasoning for Large Language Models
Yuqing Wang, Yun Zhao

TL;DR
TRAM is a comprehensive benchmark consisting of ten datasets designed to evaluate the temporal reasoning abilities of large language models across various temporal aspects, highlighting current limitations and guiding future improvements.
Contribution
This paper introduces TRAM, the first standardized, multi-faceted benchmark for evaluating temporal reasoning in large language models, covering diverse temporal tasks.
Findings
Current LLMs lag behind human performance in temporal reasoning
GPT-4 and Llama2 show limited capabilities in zero-shot and few-shot settings
Baseline models perform significantly worse than state-of-the-art models
Abstract
Reasoning about time is essential for understanding the nuances of events described in natural language. Previous research on this topic has been limited in scope, characterized by a lack of standardized benchmarks that would allow for consistent evaluations across different studies. In this paper, we introduce TRAM, a temporal reasoning benchmark composed of ten datasets, encompassing various temporal aspects of events such as order, arithmetic, frequency, and duration, designed to facilitate a comprehensive evaluation of the TeR capabilities of large language models (LLMs). We evaluate popular LLMs like GPT-4 and Llama2 in zero-shot and few-shot scenarios, and establish baselines with BERT-based and domain-specific models. Our findings indicate that the best-performing model lags significantly behind human performance. It is our aspiration that TRAM will spur further progress in…
Peer Reviews
Decision·Submitted to ICLR 2024
* Detailed description of dataset creation, sources, templates, and prompts used. * Insightful error analysis, which investigated every specific error type at a task level. * Results on several LLMS like GPT-4/3.5, Llama2, Palm2
* There are many specifically designed models to solve temporal reasoning. None of these models are included in the benchmarks. Without these, it is difficult to compare results between LLMs and RoBERTa or BERT. What goodness that LLMs bring in which tasks compared to these special models which are smaller compared to LLMs? [1] Yuan, Weizhe, and Pengfei Liu. "reStructured Pre-training." (2022) [2] Ben Zhou, Kyle Richardson, Qiang Ning, Tushar Khot, Ashish Sabharwal, and Dan Roth. (2021). "Te
1, The author introduces a new dataset and benchmark for evaluating the temporal reasoning capabilities of large language models with sufficient amounts of data in different time-related domains, including duration, frequency, ordering, etc. 2, The author provides an in-detail description of the format of the benchmark dataset. 3, The author provides a comprehensive experimental evaluation of popular LLMs, including GPT-4, GP3-3.5, and Llamma2 on the TRAM benchmark. 4, the author provides
1, As the paper primarily focuses on the area of datasets and benchmarks in large language models, it is better to provide an anonymous GitHub page, for example (https://anonymous.4open.science/) with code for dataset curation and empirical evaluation, as well as simple documentation on running the LLM’s assessments. --- 2, At this point, the overall contribution of the dataset curation done by the authors is unclear. It is better to provide some examples for comparing the differences between
• A comprehensive benchmark covers various temporal reasoning abilities: ordering, frequency, duration, typical time, ambiguity, arithmetic, relation, temporal NLI, causality, storytelling. • The overall size of the dataset is big, being 526,068 problems for benchmarking. • Pretraining-finetuning and prompting paradigms of LLMs are both evaluated using the benchmarking providing reasoning evaluation conclusions. It is a good starting point from which the community can evolve the LLM techniques
• The benchmark currently is only in the form of multi-choice questions. • The sizes of different categories of problems are imbalanced. For example, causality is of only 600 problems. This might render the benchmarking evaluation results misguiding. Especially for the pretraining-finetuning paradigms. • The texts are mostly from existing datasets. Latest LLMs might have seen them through the pretraining phrase crawled dataset. It might make the benchmarking results over-estimate the performa
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
MethodsMulti-Head Attention · Attention Is All You Need · Dropout · Dense Connections · Linear Layer · Label Smoothing · Adam · Absolute Position Encodings · Residual Connection · Layer Normalization
