TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios

Shaohang Wei; Wei Li; Feifan Song; Wen Luo; Tianyi Zhuang; Haochen Tan; Zhijiang Guo; Houfeng Wang

arXiv:2505.12891·cs.AI·October 9, 2025

TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios

Shaohang Wei, Wei Li, Feifan Song, Wen Luo, Tianyi Zhuang, Haochen Tan, Zhijiang Guo, Houfeng Wang

PDF

Open Access 1 Repo 3 Datasets

TL;DR

The paper introduces TIME, a comprehensive multi-level benchmark with over 38,000 QA pairs designed to evaluate and advance temporal reasoning in large language models within real-world scenarios, addressing key challenges like complex dependencies and dynamic events.

Contribution

It presents a new large-scale benchmark with diverse sub-tasks and datasets, along with analysis of model performance and the release of a human-annotated subset to promote future research.

Findings

01

Models show varying performance across different real-world scenarios.

02

Test-time scaling impacts temporal reasoning capabilities.

03

The benchmark reveals gaps in current models' temporal understanding.

Abstract

Temporal reasoning is pivotal for Large Language Models (LLMs) to comprehend the real world. However, existing works neglect the real-world challenges for temporal reasoning: (1) intensive temporal information, (2) fast-changing event dynamics, and (3) complex temporal dependencies in social interactions. To bridge this gap, we propose a multi-level benchmark TIME, designed for temporal reasoning in real-world scenarios. TIME consists of 38,522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks. This benchmark encompasses 3 sub-datasets reflecting different real-world challenges: TIME-Wiki, TIME-News, and TIME-Dial. We conduct extensive experiments on reasoning models and non-reasoning models. And we conducted an in-depth analysis of temporal reasoning performance across diverse real-world scenarios and tasks, and summarized the impact of test-time scaling on temporal reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sylvain-wei/time
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Constraint Satisfaction and Optimization