TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in   Large Language Models

Zheng Chu; Jingchang Chen; Qianglong Chen; Weijiang Yu; Haotian Wang,; Ming Liu; Bing Qin

arXiv:2311.17667·cs.CL·July 1, 2024·2 cites

TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models

Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu, Haotian Wang,, Ming Liu, Bing Qin

PDF

Open Access 1 Repo 1 Video

TL;DR

TimeBench is a comprehensive benchmark designed to evaluate the temporal reasoning abilities of large language models across various phenomena, revealing significant gaps compared to human performance and highlighting challenges in the field.

Contribution

This paper introduces TimeBench, the first extensive hierarchical benchmark for temporal reasoning in large language models, covering diverse reasoning categories and providing a platform for future research.

Findings

01

LLMs lag behind humans in temporal reasoning tasks.

02

Performance varies across different reasoning categories.

03

Multiple factors influence LLMs' temporal reasoning capabilities.

Abstract

Grasping the concept of time is a fundamental facet of human cognition, indispensable for truly comprehending the intricacies of the world. Previous studies typically focus on specific aspects of time, lacking a comprehensive temporal reasoning benchmark. To address this, we propose TimeBench, a comprehensive hierarchical temporal reasoning benchmark that covers a broad spectrum of temporal reasoning phenomena. TimeBench provides a thorough evaluation for investigating the temporal reasoning capabilities of large language models. We conduct extensive experiments on GPT-4, LLaMA2, and other popular LLMs under various settings. Our experimental results indicate a significant performance gap between the state-of-the-art LLMs and humans, highlighting that there is still a considerable distance to cover in temporal reasoning. Besides, LLMs exhibit capability discrepancies across different…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zchuz/timebench
noneOfficial

Videos

TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Healthcare

MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Dropout · Byte Pair Encoding · Softmax · Layer Normalization · Position-Wise Feed-Forward Layer · Linear Layer · Absolute Position Encodings