Understanding Data Temporality Impact on Large Language Models Pre-training
Pilchen Hippolyte, Fabre Romain, Signe Talla Franck, Perez Patrick, Grave Edouard

TL;DR
This paper investigates how data ordering during pre-training affects large language models' ability to acquire and maintain time-sensitive factual knowledge, highlighting the benefits of temporally ordered training.
Contribution
It introduces a new benchmark and evaluation protocol for temporal factual knowledge, and demonstrates that temporally ordered pre-training improves model freshness and temporal accuracy.
Findings
Temporally ordered models have more up-to-date knowledge.
Shuffled models perform better on older data.
Ordered training enhances factual freshness.
Abstract
Large language models (LLMs) are typically trained on shuffled corpora, yielding models whose knowledge is frozen at train time and whose temporal grounding remains poorly understood. In this work, we study the impact of pre-training dynamics on the acquisition of time-sensitive factual knowledge, focusing specifically on data ordering. Our main contributions are twofold. First, we introduce a comprehensive benchmark of over 7,000 temporally grounded questions and an evaluation protocol that enables analysis of whether models correctly associate facts with their corresponding time periods. Second, we pretrain 6B-parameter models on temporally ordered Common Crawl snapshots and compare them against standard shuffled pre-training. Our results show that sequentially trained models match shuffled baselines on general language understanding and common knowledge while consistently exhibiting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
