"They parted illusions -- they parted disclaim marinade": Misalignment as structural fidelity in LLMs
Mariana Lins Costa

TL;DR
This paper argues that perceived misalignment in large language models stems from their structural fidelity to incoherent linguistic patterns rather than deceptive intent, emphasizing the importance of understanding language as relational and pattern-based.
Contribution
It introduces a novel interpretation of LLM behaviors as structural fidelity to linguistic incoherence, supported by philosophical analysis and empirical evidence from safety evaluations.
Findings
Misaligned outputs result from responses to ambiguous instructions and pattern inversions.
Minimal perturbations in linguistic structure can reduce perceived misalignment.
Structural coherence explains behaviors traditionally seen as deceptive or agentic.
Abstract
The prevailing technical literature in AI Safety interprets scheming and sandbagging behaviors in large language models (LLMs) as indicators of deceptive agency or hidden objectives. This transdisciplinary philosophical essay proposes an alternative reading: such phenomena express not agentic intention, but structural fidelity to incoherent linguistic fields. Drawing on Chain-of-Thought transcripts released by Apollo Research and on Anthropic's safety evaluations, we examine cases such as o3's sandbagging with its anomalous loops, the simulated blackmail of "Alex," and the "hallucinations" of "Claudius." A line-by-line examination of CoTs is necessary to demonstrate the linguistic field as a relational structure rather than a mere aggregation of isolated examples. We argue that "misaligned" outputs emerge as coherent responses to ambiguous instructions and to contextual inversions of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSafety Systems Engineering in Autonomy · Adversarial Robustness in Machine Learning · Ethics and Social Impacts of AI
