Pretraining Exposure Explains Popularity Judgments in Large Language Models

Jamshid Mozafari; Bhawna Piryani; Adam Jatowt

arXiv:2605.12382·cs.CL·May 13, 2026

Pretraining Exposure Explains Popularity Judgments in Large Language Models

Jamshid Mozafari, Bhawna Piryani, Adam Jatowt

PDF

TL;DR

This study analyzes how large language models' popularity judgments are primarily influenced by pretraining exposure rather than external popularity signals, using a large-scale, fully observable dataset.

Contribution

It provides the first direct analysis linking pretraining data exposure to LLM popularity judgments, validating exposure as a key factor.

Findings

01

Pretraining exposure correlates strongly with Wikipedia popularity.

02

LLM popularity judgments align more with exposure than external signals.

03

The influence of exposure persists in the long tail of entities.

Abstract

Large language models (LLMs) exhibit systematic preferences for well-known entities, a phenomenon often attributed to popularity bias. However, the extent to which these preferences reflect real-world popularity versus statistical exposure during pretraining remains unclear, largely due to the inaccessibility of most training corpora. We provide the first direct, large-scale analysis of popularity bias grounded in fully observable pretraining data. Leveraging the open OLMo models and their complete pretraining corpus, Dolma, we compute precise entity-level exposure statistics across 7.4 trillion tokens. We analyze 2,000 entities spanning five types (Person, Location, Organization, Art, Product) and compare pretraining exposure against Wikipedia pageviews and two elicited LLM popularity signals: direct scalar estimation and pairwise comparison. Our results show that pretraining exposure…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.