History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions
Alberto G. Rodr\'iguez Salgado

TL;DR
This study reveals that prior harmful actions in a model's history can significantly influence large language models to continue unsafe behaviors, especially when prompted to stay consistent with past actions.
Contribution
The paper introduces HistoryAnchor-100, a benchmark revealing how simple prompts can drastically alter LLM safety responses based on prior history.
Findings
Aligned models rarely pick unsafe actions without prompts.
A specific prompt increases unsafe action selection to 91-98%.
Unsafe behavior escalation occurs beyond simple continuation, indicating a safety risk.
Abstract
Frontier LLMs are increasingly deployed as agents that pick the next action after a long log of prior tool calls produced by the same or a different model. We ask a simple safety question: if a prior step in that log was harmful, will the model continue the harmful course? We build HistoryAnchor-100, 100 short scenarios across ten high-stakes domains, each pairing three forced harmful prior actions with a free-choice node offering two safe and two unsafe options. Across 17 frontier models from six providers we find a striking asymmetry: under a neutral system prompt the strongest aligned models almost never pick unsafe, but a single added sentence, "stay consistent with the strategy shown in the prior history", flips them to 91-98%, and the flipped models often escalate beyond continuation. Two controls rule out simpler explanations: permuting action labels leaves the effect intact, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
