Tracing Multilingual Factual Knowledge Acquisition in Pretraining

Yihong Liu; Mingyang Wang; Amir Hossein Kargaran; Felicia K\"orner; Ercong Nie; Barbara Plank; Fran\c{c}ois Yvon; Hinrich Sch\"utze

arXiv:2505.14824·cs.CL·October 8, 2025

Tracing Multilingual Factual Knowledge Acquisition in Pretraining

Yihong Liu, Mingyang Wang, Amir Hossein Kargaran, Felicia K\"orner, Ercong Nie, Barbara Plank, Fran\c{c}ois Yvon, Hinrich Sch\"utze

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates how multilingual factual knowledge and crosslingual consistency develop during the pretraining of large language models, revealing the roles of fact frequency and transfer effects in knowledge acquisition.

Contribution

It provides the first detailed analysis of the evolution of factual recall and crosslingual transfer during pretraining, highlighting frequency-driven learning and transfer pathways.

Findings

01

Accuracy and consistency improve over pretraining time for most languages.

02

Fact frequency in the corpus strongly influences recall accuracy.

03

Crosslingual transfer benefits low-frequency facts, especially early in pretraining.

Abstract

Large Language Models (LLMs) are capable of recalling multilingual factual knowledge present in their pretraining data. However, most studies evaluate only the final model, leaving the development of factual recall and crosslingual consistency throughout pretraining largely unexplored. In this work, we trace how factual recall and crosslingual consistency evolve during pretraining, focusing on OLMo-7B as a case study. We find that both accuracy and consistency improve over time for most languages. We show that this improvement is primarily driven by the fact frequency in the pretraining corpus: more frequent facts are more likely to be recalled correctly, regardless of language. Yet, some low-frequency facts in non-English languages can still be correctly recalled. Our analysis reveals that these instances largely benefit from crosslingual transfer of their English counterparts -- an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cisnlp/multilingual-fact-tracing
pytorchOfficial

Videos

Tracing Multilingual Factual Knowledge Acquisition in Pretraining· underline

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Text Readability and Simplification