LTLBench: Towards Benchmarks for Evaluating Temporal Reasoning in Large Language Models
Weizhi Tang, Kwabena Nuamah, Vaishak Belle

TL;DR
This paper introduces LTLBench, a new benchmark dataset of 2000 challenges based on Linear Temporal Logic, to evaluate and analyze the temporal reasoning abilities of large language models.
Contribution
It presents a novel approach using LTL to synthesize challenges, creates a comprehensive dataset, and benchmarks multiple LLMs to understand their temporal reasoning capabilities.
Findings
LLMs show varied performance on LTL-based challenges.
Increasing complexity affects LLM reasoning and performance unexpectedly.
Qualitative analysis reveals key issues in LLM temporal reasoning processes.
Abstract
Temporal Reasoning (TR) is a critical ability for LLMs to understand and reason over temporal information and relationships between events. To study the TR ability in LLMs, prior works provide different ways for evaluating various aspects of TR ability. In this work, we propose an alternative perspective for evaluating TR ability by leveraging Linear Temporal Logic (LTL), and develop a pipeline to automatically synthesize challenges for assessing the TR ability of LLMs. Based on this pipeline, we construct a dataset, namely LTLBench, consisting of TR challenges, and benchmark 12 LLMs across 5 different methods. Furthermore, we conduct additional experiments to investigate the impact of increasing the number of formula operators and events on both LLM performance and the complexity of TR problems. We also perform qualitative analyses of their reasoning processes and the effects of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Logic, Reasoning, and Knowledge
