A Stochastic Dynamical Theory of LLM Self-Adversariality: Modeling Severity Drift as a Critical Process
Jack David Carson

TL;DR
This paper develops a stochastic dynamical model to analyze how large language models might self-amplify biases or toxicity, revealing phase transitions and critical phenomena that influence model stability and bias propagation.
Contribution
It introduces a novel continuous-time stochastic framework for modeling LLM severity drift, enabling analysis of critical thresholds and stability conditions.
Findings
Identifies phase transitions between self-correcting and runaway bias regimes.
Derives stationary distributions and first-passage times for harmful thresholds.
Provides scaling laws near critical points for bias amplification.
Abstract
This paper introduces a continuous-time stochastic dynamical framework for understanding how large language models (LLMs) may self-amplify latent biases or toxicity through their own chain-of-thought reasoning. The model posits an instantaneous "severity" variable evolving under a stochastic differential equation (SDE) with a drift term and diffusion . Crucially, such a process can be consistently analyzed via the Fokker--Planck approach if each incremental step behaves nearly Markovian in severity space. The analysis investigates critical phenomena, showing that certain parameter regimes create phase transitions from subcritical (self-correcting) to supercritical (runaway severity). The paper derives stationary distributions, first-passage times to harmful thresholds, and scaling laws near critical points. Finally, it highlights implications for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSimulation Techniques and Applications · Smart Grid Security and Resilience · Auction Theory and Applications
MethodsDiffusion
