Identity as Attractor: Geometric Evidence for Persistent Agent Architecture in LLM Activation Space
Vladimir Vasilenko

TL;DR
This paper demonstrates that the internal representations of an agent's identity in large language models form attractor-like structures, with semantic content playing a key role, supported by experiments across models and conditions.
Contribution
It provides the first empirical evidence that agent identity documents induce attractor-like geometry in LLM activation space, emphasizing semantic influence.
Findings
Paraphrases converge to a tighter cluster than controls, indicating attractor-like behavior.
Cross-architecture experiments confirm the generality of the effect.
Reading about the agent shifts internal states toward the attractor, showing knowledge of identity influences representations.
Abstract
Large language models map semantically related prompts to similar internal representations -- a phenomenon interpretable as attractor-like dynamics. We ask whether the identity document of a persistent cognitive agent (its cognitive_core) exhibits analogous attractor-like behavior. We present a controlled experiment on Llama 3.1 8B Instruct, comparing hidden states of an original cognitive_core (Condition A), seven paraphrases (Condition B), and seven structurally matched controls (Condition C). Mean-pooled states at layers 8, 16, and 24 show that paraphrases converge to a tighter cluster than controls (Cohen's d > 1.88, p < 10^{-27}, Bonferroni-corrected). Replication on Gemma 2 9B confirms cross-architecture generalizability. Ablations suggest the effect is primarily semantic rather than structural, and that structural completeness appears necessary to reach the attractor region. An…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
