Hubble: a Model Suite to Advance the Study of LLM Memorization
Johnny Tian-Zheng Wei, Ameya Godbole, Mohammad Aflah Khan, Ryan Wang, Xiaoyuan Zhu, James Flemings, Nitya Kashyap, Krishna P. Gummadi, Willie Neiswanger, Robin Jia

TL;DR
Hubble introduces a suite of open-source LLMs designed to study memorization, revealing how data frequency and training order influence memorization risks, and providing tools for further research and mitigation strategies.
Contribution
Hubble provides the first comprehensive open-source models and experimental framework for analyzing and mitigating LLM memorization risks.
Findings
Memorization correlates with data frequency and training size.
Sensitive data can be forgotten by controlled insertion during training.
Ordering data earlier in training reduces memorization.
Abstract
We present Hubble, a suite of fully open-source large language models (LLMs) for the scientific study of LLM memorization. Hubble models come in standard and perturbed variants: standard models are pretrained on a large English corpus, and perturbed models are trained in the same way but with controlled insertion of text (e.g., book passages, biographies, and test sets) designed to emulate key memorization risks. Our core release includes 8 models -- standard and perturbed models with 1B or 8B parameters, pretrained on 100B or 500B tokens -- establishing that memorization risks are determined by the frequency of sensitive data relative to size of the training corpus (i.e., a password appearing once in a smaller corpus is memorized better than the same password in a larger corpus). Our release also includes 6 perturbed models with text inserted at different pretraining phases, showing…
Peer Reviews
Decision·ICLR 2026 Oral
The strengths of the paper are listed as follows. 1. HUBBLE will be fully open-sourced upon publication, which is important for the researchers to reproduce its results and leverage the suite for their own study in LLM memorization. 2. The design of HUBBLE is reasonable and useful. The pair of standard models and perturbed models enables a fair and clean comparison when studying the mechanism of LLM memorization. 3. The open-source data and checkpoints also make the quantitative analysis of LLM
The major concern of HUBBLE is that it only covers the LLMs with 1B and 8B. Therefore, some observations from HUBBLE may not be extended to large LLMs, which usually have a higher memorization rate and are more widely adopted by commercial applications. However, for the academic research of LLM memorization, the two model sizes are sufficient.
The strengths of this work are listed as follows: - This work would help form any following work overcome one of the primary drawbacks in this field, where experiments had to be either painfully small scale or had to rely on somewhat unknown data patterns of models trained without a focus on memorization studies - The predictable and standard nature of the perturbations added to the training data would help any further study conduct clear A/B tests on this study - I really like the detailed nat
- This is an (computationally)expensive study and ideally more patterns could/should have been embedded into the training data to make the most out of it. One interesting direction might've been understanding the relation between the distance between paraphrased samples and the degree to which they're memorized - Test set insertions focus on benchmarks like MMLU and HellaSwag, whereas contamination in real corpora can be indirect or paraphrastic. - The work isolates memorization metrics from g
**Highly useful model suite**: The Hubble model suite is highly useful for memorization research, and hence serves as a more contemporary test bed compared to, say, Pythia. For one, the 8 core models allow studying the causal effects of adding certain types of data in a broad range of domains. The authors choose domains and datasets well, and they demonstrate that the models are useful to measure things beyond pure verbatim memorization. The model ablations ("Runs" on p. 4) are also noteworthy.
While this is an overall high-quality paper, there is one potential major issue related to decontamination between the base training corpus and perturbations. I am happy to raise my score if those issues are addressed. **Potential contamination between base corpus and perturbations**: The decontamination step of Hubble's data pipeline focuses on high precision (i.e., only discarding samples that clearly overlap the base corpus and perturbations). However, false negatives (i.e., contamination th
Code & Models
- 🤗allegrolab/hubble-1b-100b_toks-perturbed-neoxmodel
- 🤗allegrolab/hubble-1b-100b_toks-standard-neoxmodel
- 🤗allegrolab/hubble-1b-500b_toks-standard-neoxmodel
- 🤗allegrolab/hubble-1b-500b_toks-perturbed-neoxmodel
- 🤗allegrolab/hubble-8b-100b_toks-perturbed-neoxmodel
- 🤗allegrolab/hubble-8b-500b_toks-perturbed-neoxmodel
- 🤗allegrolab/hubble-8b-100b_toks-standard-neoxmodel
- 🤗allegrolab/hubble-8b-500b_toks-standard-neoxmodel
- 🤗allegrolab/hubble-1b-500b_toks-standard-hfmodel· 307 dl307 dl
- 🤗allegrolab/hubble-1b-100b_toks-perturbed-hfmodel· 313 dl313 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Authorship Attribution and Profiling
