Fusion-fission forecasts when AI will shift to undesirable behavior
Neil F. Johnson, Frank Yingjie Huo

TL;DR
This paper introduces a mathematical model based on fusion-fission group dynamics to predict when AI behavior might shift from desirable to undesirable, validated across multiple models and datasets.
Contribution
It presents a novel, model-agnostic forecasting method for AI behavior shifts using group dynamics, providing real-time warnings beyond current safety measures.
Findings
Achieved 90% accuracy across seven AI models
Validated predictions across ten chatbots and a large human-AI exchange corpus
Forecasted behavior shifts eleven months in advance
Abstract
The key problem facing ChatGPT-like AI's use across society is that its behavior can shift, unnoticed, from desirable to undesirable -- encouraging self-harm, extremist acts, financial losses, or costly medical and military mistakes -- and no one can yet predict when. Shifts persist in even the newest AI models despite remarkable progress in AI modeling, post-training alignment and safeguards. Here we show that a vector generalization of fusion-fission group dynamics observed in living and active-matter systems drives -- and can forecast -- future shifts in the AI's behavior. The shift condition, which is also derivable mathematically, results from group-level competition between the conversation-so-far (C) and the desirable (B) and undesirable (D) basin dynamics which can be estimated in advance for a given application. It is neither model-specific nor driven by stochastic sampling. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
