Hubble: a Model Suite to Advance the Study of LLM Memorization

Johnny Tian-Zheng Wei; Ameya Godbole; Mohammad Aflah Khan; Ryan Wang; Xiaoyuan Zhu; James Flemings; Nitya Kashyap; Krishna P. Gummadi; Willie Neiswanger; Robin Jia

arXiv:2510.19811·cs.CL·October 23, 2025

Hubble: a Model Suite to Advance the Study of LLM Memorization

Johnny Tian-Zheng Wei, Ameya Godbole, Mohammad Aflah Khan, Ryan Wang, Xiaoyuan Zhu, James Flemings, Nitya Kashyap, Krishna P. Gummadi, Willie Neiswanger, Robin Jia

PDF

Open Access 10 Models 1 Datasets 3 Reviews

TL;DR

Hubble introduces a suite of open-source LLMs designed to study memorization, revealing how data frequency and training order influence memorization risks, and providing tools for further research and mitigation strategies.

Contribution

Hubble provides the first comprehensive open-source models and experimental framework for analyzing and mitigating LLM memorization risks.

Findings

01

Memorization correlates with data frequency and training size.

02

Sensitive data can be forgotten by controlled insertion during training.

03

Ordering data earlier in training reduces memorization.

Abstract

We present Hubble, a suite of fully open-source large language models (LLMs) for the scientific study of LLM memorization. Hubble models come in standard and perturbed variants: standard models are pretrained on a large English corpus, and perturbed models are trained in the same way but with controlled insertion of text (e.g., book passages, biographies, and test sets) designed to emulate key memorization risks. Our core release includes 8 models -- standard and perturbed models with 1B or 8B parameters, pretrained on 100B or 500B tokens -- establishing that memorization risks are determined by the frequency of sensitive data relative to size of the training corpus (i.e., a password appearing once in a smaller corpus is memorized better than the same password in a larger corpus). Our release also includes 6 perturbed models with text inserted at different pretraining phases, showing…

Peer Reviews

Decision·ICLR 2026 Oral

Reviewer 01Rating 8Confidence 5

Strengths

The strengths of the paper are listed as follows. 1. HUBBLE will be fully open-sourced upon publication, which is important for the researchers to reproduce its results and leverage the suite for their own study in LLM memorization. 2. The design of HUBBLE is reasonable and useful. The pair of standard models and perturbed models enables a fair and clean comparison when studying the mechanism of LLM memorization. 3. The open-source data and checkpoints also make the quantitative analysis of LLM

Weaknesses

The major concern of HUBBLE is that it only covers the LLMs with 1B and 8B. Therefore, some observations from HUBBLE may not be extended to large LLMs, which usually have a higher memorization rate and are more widely adopted by commercial applications. However, for the academic research of LLM memorization, the two model sizes are sufficient.

Reviewer 02Rating 8Confidence 4

Strengths

The strengths of this work are listed as follows: - This work would help form any following work overcome one of the primary drawbacks in this field, where experiments had to be either painfully small scale or had to rely on somewhat unknown data patterns of models trained without a focus on memorization studies - The predictable and standard nature of the perturbations added to the training data would help any further study conduct clear A/B tests on this study - I really like the detailed nat

Weaknesses

- This is an (computationally)expensive study and ideally more patterns could/should have been embedded into the training data to make the most out of it. One interesting direction might've been understanding the relation between the distance between paraphrased samples and the degree to which they're memorized - Test set insertions focus on benchmarks like MMLU and HellaSwag, whereas contamination in real corpora can be indirect or paraphrastic. - The work isolates memorization metrics from g

Reviewer 03Rating 6Confidence 3

Strengths

**Highly useful model suite**: The Hubble model suite is highly useful for memorization research, and hence serves as a more contemporary test bed compared to, say, Pythia. For one, the 8 core models allow studying the causal effects of adding certain types of data in a broad range of domains. The authors choose domains and datasets well, and they demonstrate that the models are useful to measure things beyond pure verbatim memorization. The model ablations ("Runs" on p. 4) are also noteworthy.

Weaknesses

While this is an overall high-quality paper, there is one potential major issue related to decontamination between the base training corpus and perturbations. I am happy to raise my score if those issues are addressed. **Potential contamination between base corpus and perturbations**: The decontamination step of Hubble's data pipeline focuses on high precision (i.e., only discarding samples that clearly overlap the base corpus and perturbations). However, false negatives (i.e., contamination th

Code & Models

Models

Datasets

allegrolab/dclm-baseline-500b_toks
dataset· 600 dl
600 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Authorship Attribution and Profiling