Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time
David Herel, Vojtech Bartek, Jiri Jirak, Tomas Mikolov

TL;DR
This paper introduces a new benchmark and evaluation framework to assess large language models' ability to recall facts accurately across different points in time, highlighting current limitations in temporal reasoning.
Contribution
We present a novel dataset and evaluation method for temporal fact recall, revealing challenges in LLMs' temporal consistency and performance across different model types.
Findings
Base models outperform instruction-tuned models on time-sensitive recall.
Large models show brittleness with paraphrased facts.
Temporal reasoning remains a significant challenge for LLMs.
Abstract
Who is the US President? The answer changes depending on when the question is asked. While large language models (LLMs) are evaluated on various reasoning tasks, they often miss a crucial dimension: time. In real-world scenarios, the correctness of answers is frequently tied to temporal context. To address this gap, we present a novel framework and dataset spanning over 8,000 events from 2018 to 2024, annotated with day-level granularity and sourced globally across domains such as politics, science, and business. Our TimeShift evaluation method systematically probes LLMs for temporal reasoning, revealing that base models often outperform instruction-tuned and synthetic-trained counterparts on time-sensitive recall. Additionally, we find that even large-scale models exhibit brittleness in handling paraphrased facts, highlighting unresolved challenges in temporal consistency. By…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Semantic Web and Ontologies · Advanced Text Analysis Techniques
MethodsBalanced Selection · ALIGN
