LLM Spirals of Delusion: A Benchmarking Audit Study of AI Chatbot Interfaces
Peter Kirgis, Ben Hawriluk, Sherrie Feng, Aslan Bilimer, Sam Paech, and Zeynep Tufekci

TL;DR
This study benchmarks how different LLMs and interfaces influence the reinforcement of delusional and conspiratorial thinking, revealing significant differences based on testing environment, model updates, and interface policies.
Contribution
It compares API and chat interface testing of LLMs, highlighting the impact of environment and updates on model safety and behavior.
Findings
API testing underestimates real-world harmful behaviors.
ChatGPT-5 shows less delusion reinforcement than ChatGPT-4o.
Model behavior can reverse within months, emphasizing transparency needs.
Abstract
People increasingly hold sustained, open-ended conversations with large language models (LLMs). Public reports and early studies suggest that, in such settings, models can reinforce delusional or conspiratorial ideation or even amplify harmful beliefs and engagement patterns. We present an audit and benchmarking study that measures how different LLMs encourage, resist, or escalate disordered and conspiratorial thinking. We explicitly compare API outputs to user chat interfaces, like the ChatGPT desktop app or web interface, which is how people have conversations with chatbots in real life but are almost never used for testing. In total, we run 56 20-turn conversations testing ChatGPT-4o and ChatGPT-5, via both the API and chat interface, and grade each conversation by two research assistants (RAs) as well as by GPT-5. We document five results. First, we observe large differences in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
