Critical Confabulation: Can LLMs Hallucinate for Social Good?
Peiqi Sui, Eamon Duede, Hoyt Long, Richard Jean So

TL;DR
This paper explores how large language models can ethically generate plausible historical narratives to fill gaps caused by social and political biases, supporting social good without compromising accuracy.
Contribution
It introduces the concept of critical confabulation, demonstrating how LLMs can generate evidence-bound narratives to address historical omissions and biases.
Findings
LLMs can perform critical confabulation to reconstruct missing historical narratives.
Controlled hallucinations can support knowledge production without losing fidelity.
Validated LLMs' ability to generate plausible, evidence-based historical fill-ins.
Abstract
LLMs hallucinate, yet some confabulations can have social affordances if carefully bounded. We propose critical confabulation (inspired by critical fabulation from literary and social theory), the use of LLM hallucinations to "fill-in-the-gap" for omissions in archives due to social and political inequality, and reconstruct divergent yet evidence-bound narratives for history's ``hidden figures''. We simulate these gaps with an open-ended narrative cloze task: asking LLMs to generate a masked event in a character-centric timeline sourced from a novel corpus of unpublished texts. We evaluate audited (for data contamination), fully-open models (the OLMo-2 family) and unaudited open-weight and proprietary baselines under a range of prompts designed to elicit controlled and useful hallucinations. Our findings validate LLMs' foundational narrative understanding capabilities to perform…
Peer Reviews
Decision·ICLR 2026 Poster
I feel like this paper is so unlike most of the ICLR paper I’ve seen in recent years, especially with its use of historical archival data and its engagement with African American Studies. I also appreciate the authors’ thoughtfulness around experimental decisions.
Though this work is a rare interdisciplinary blend of technical and substantive work, I worry that the way the authors explain their methodology and results may be a barrier for this work actually being useful for “social good”. That is, true “social good” should make research accessible and communicable to the communities associated with the datasets involved. Otherwise, such work is at risk of being extractive. Even as someone with a technical background, I found that there were some parts of
I think this paper raises a very interesting and socially meaningful problem, and I love to see how people connect LLM capabilities to social good in unexpected ways. I think in order to do that, the authors have to really make sure that the task they choose for hallucinations should not be contaminated, and much of the heavy lifting in this paper is done at finding that task and data, and the authors have accomplished with careful justifications of data contamination (Section 3.2). I also love
I think the paper would benefit from more qualitative examples and analyses, like showing the failure modes of LLMs and try to analyze the hard data where most models are unable to get them right. In addition, the presentation can be improved. For instance, many figures in the paper have really small fonts and are very hard to see. Figure 1 can also be more dense and displayed with larger font, as I have to zoom in on a laptop to figure out what is being displayed.
The work is very novel and has very good quality in its empirical grounding. For novelty, the paper applies an LLM to a novel domain, namely the study of history. I am not aware of many applications of AI or machine learning to this domain. Additionally, there are also few works that exploit the hallucination/confabulation ability for academic pursuits. Many academic pursuits are based in facts and substantiated claims, so the use of the creative part of an LLM in the academic domain is both cou
The paper could benefit from some clarity enhancement. For example, I have some trouble picturing the differences between the validation cloze tasks. I think paper would benefit from providing some concrete examples of full timeline $\rightarrow$ partial cloze set up $\rightarrow$ n-gram cloze set up (maybe in an appendix). Additionally, it's not clear why some of the thresholds are what they are. For example, when extracting name candidates, why do you take the top 10,000 persons, and why is le
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Humanities and Scholarship · Computational and Text Analysis Methods · Topic Modeling
