FictionalQA: A Dataset for Studying Memorization and Knowledge Acquisition
John Kirchenbauer, Janny Mongkolsupawan, Yuxin Wen, Tom Goldstein, Daphne Ippolito

TL;DR
FictionalQA introduces a synthetic dataset designed to help researchers analyze how language models memorize facts versus verbatim sequences, advancing understanding of knowledge acquisition during training.
Contribution
The paper presents a novel dataset and experimental framework for studying fact memorization and sequence memorization in language models.
Findings
Synthetic data can effectively differentiate types of memorization.
Challenges exist in creating realistic fictional synthetic data.
Experiments demonstrate the dataset's utility in memorization studies.
Abstract
When language models are trained on textual data, they acquire both knowledge about the structure of language as well as knowledge of facts about the world. At inference time, their knowledge of facts can be leveraged to solve interesting problems and perform useful knowledge work for users. It is well known that language models can verbatim memorize long sequences from their training data. However, it is much less well understood how language models memorize facts seen during training. In this work, we propose a new dataset to specifically empower researchers to study the dual processes of fact memorization and verbatim sequence memorization. The dataset consists of synthetically-generated, webtext-like documents about fictional events, as well as question-answer pairs about the events. We conduct training experiments showing how synthetic data about fictional events can be useful for…
Peer Reviews
Decision·ICLR 2026 Poster
Originality: - Novel idea for the data generation pipeline from fictional seeds to fictsheets, to documents then to Q/A - With different types of styles it enables interesting controllable ways to test memorization Quality: - Thoughtful experimental design with multiple data splits & as above makes for a nice controllable setup Clarity: - The paper is well written and the method is very easy to follow. Everything is pretty clear from Fig 1 - The code release and prompts will also help with
(1) Leakiness result in Fig 8 undermines the previous results and is buried towards the end. The models on the val set perform almost equally well as those that were trained on. The core premise of previous results was that one could cleanly separate factual memorization from verbatim memorization. However, this result shows models don’t learn verbatim (since the validation documents are unseen) and don’t learn atomic (which wouldn’t transfer anyway). Instead, are learning style or textual distr
The paper is well-motivated and addresses a valuable research question, in how LLMs acquire factual knowledge. The use of fictional information for studies of this sort is not novel, but is performed in a careful, transparent, and reproducible manner, which is valuable for future use. The dataset is applied in various experiments to demonstrate its utility at a limited scale, with interesting results. While the results are limited due to the realism of the test setting, the authors are aware of
I’m confused by the claim in the paper that the main contribution is a pipeline for creating fictitious datasets. This does not seem like the main focus of the paper, with most of the time being spent on a particular dataset and conducting experiments. If the data generation pipeline were the main contribution, I would expect more time to be spent demonstrating the quality of the generated content relative to previous works. The “leaky” generalization finding is quite troubling, with a lacklus
1. The FictionalQA dataset itself is a significant contribution. The design principle of creating data that is factually disjoint from the real world but stylistically realistic is a powerful idea. This creates a controlled "laboratory" setting to study memorization without confounding variables from existing world knowledge. 2. The experimental finding that models generalize factual knowledge better from diverse documents than from structured "Fictsheets" is insightful. This suggests that the s
1. The most significant weakness is the "leaky" generalization shown in Figure 7. The MCQ accuracy for held out validation events (Val) also increases significantly, even in the Event Split. The authors rightly note this makes it difficult to cleanly separate true factual memorization from stylistic memorization. The model may be learning the pattern of the fictional data or Q&A rather than just the atomic facts. 2. The multiple choice question (MCQ) evaluation seems to have some issues. The bas
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Sentiment Analysis and Opinion Mining
