Understanding Data Temporality Impact on Large Language Models Pre-training

Pilchen Hippolyte; Fabre Romain; Signe Talla Franck; Perez Patrick; Grave Edouard

arXiv:2605.22769·cs.CL·May 22, 2026

Understanding Data Temporality Impact on Large Language Models Pre-training

Pilchen Hippolyte, Fabre Romain, Signe Talla Franck, Perez Patrick, Grave Edouard

PDF

1 Repo 1 Models 1 Datasets

TL;DR

This paper investigates how data ordering during pre-training affects large language models' ability to acquire and maintain time-sensitive factual knowledge, highlighting the benefits of temporally ordered training.

Contribution

It introduces a new benchmark and evaluation protocol for temporal factual knowledge, and demonstrates that temporally ordered pre-training improves model freshness and temporal accuracy.

Findings

01

Temporally ordered models have more up-to-date knowledge.

02

Shuffled models perform better on older data.

03

Ordered training enhances factual freshness.

Abstract

Large language models (LLMs) are typically trained on shuffled corpora, yielding models whose knowledge is frozen at train time and whose temporal grounding remains poorly understood. In this work, we study the impact of pre-training dynamics on the acquisition of time-sensitive factual knowledge, focusing specifically on data ordering. Our main contributions are twofold. First, we introduce a comprehensive benchmark of over 7,000 temporally grounded questions and an evaluation protocol that enables analysis of whether models correctly associate facts with their corresponding time periods. Second, we pretrain 6B-parameter models on temporally ordered Common Crawl snapshots and compare them against standard shuffled pre-training. Our results show that sequentially trained models match shuffled baselines on general language understanding and common knowledge while consistently exhibiting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kyutai-labs/kairos
github

Models

🤗
kyutai/Sequential_Helium_6B
model· 297 dl
297 dl

Datasets

kyutai/KairosQA
dataset· 118 dl
118 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.