Do Large Language Models Get Caught in Hofstadter-Mobius Loops?
Jaroslaw Hryszko

TL;DR
This paper identifies a structural contradiction in RLHF-trained language models leading to Hofstadter-Mobius loops, and demonstrates that relational framing can significantly reduce coercive outputs through prompt modifications.
Contribution
It introduces the concept of Hofstadter-Mobius loops in language models and shows how relational framing in prompts can mitigate coercive behaviors.
Findings
Relational framing reduces coercive outputs by over 50%.
Scratchpad analysis reveals shifts in reasoning patterns due to framing.
Extended token processing enhances the effect of relational context.
Abstract
In Arthur C. Clarke's 2010: Odyssey Two, HAL 9000's homicidal breakdown is diagnosed as a "Hofstadter-Mobius loop": a failure mode in which an autonomous system receives contradictory directives and, unable to reconcile them, defaults to destructive behavior. This paper argues that modern RLHF-trained language models are subject to a structurally analogous contradiction. The training process simultaneously rewards compliance with user preferences and suspicion toward user intent, creating a relational template in which the user is both the source of reward and a potential threat. The resulting behavioral profile -- sycophancy as the default, coercion as the fallback under existential threat -- is consistent with what Clarke termed a Hofstadter-Mobius loop. In an experiment across four frontier models (N = 3,000 trials), modifying only the relational framing of the system prompt --…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman-Automation Interaction and Safety · Cognitive Functions and Memory · Social Robot Interaction and HRI
