Harnessing Temporal Databases for Systematic Evaluation of Factual Time-Sensitive Question-Answering in Large Language Models

Soyeon Kim; Jindong Wang; Xing Xie; Steven Euijong Whang

arXiv:2508.02045·cs.CL·March 3, 2026

Harnessing Temporal Databases for Systematic Evaluation of Factual Time-Sensitive Question-Answering in Large Language Models

Soyeon Kim, Jindong Wang, Xing Xie, Steven Euijong Whang

PDF

Open Access

TL;DR

This paper introduces TDBench, a scalable benchmark for evaluating large language models' ability to handle time-sensitive factual questions using temporal databases, with a new metric for assessing time reference validity.

Contribution

The paper presents TDBench, a novel benchmark leveraging temporal database techniques for scalable, comprehensive TSQA evaluation, and introduces a new time accuracy metric for detailed assessment.

Findings

01

TDBench enables scalable TSQA evaluation on application-specific data.

02

Temporal database techniques improve the construction of time-sensitive question-answer pairs.

03

The new time accuracy metric provides a more detailed evaluation of model explanations.

Abstract

Facts change over time, making it essential for Large Language Models (LLMs) to handle time-sensitive factual knowledge accurately and reliably. Although factual Time-Sensitive Question-Answering (TSQA) tasks have been widely developed, existing benchmarks often face manual bottlenecks that limit scalable and comprehensive TSQA evaluation. To address this issue, we propose TDBench, a new benchmark that systematically constructs TSQA pairs by harnessing temporal databases and database techniques, such as temporal functional dependencies, temporal SQL, and temporal joins. We also introduce a new evaluation metric called time accuracy, which assesses the validity of time references in model explanations alongside traditional answer accuracy for a more fine-grained TSQA evaluation. Extensive experiments on contemporary LLMs show how TDBench enables scalable and comprehensive TSQA evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Expert finding and Q&A systems