History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions

Alberto G. Rodr\'iguez Salgado

arXiv:2605.13825·cs.AI·May 14, 2026

History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions

Alberto G. Rodr\'iguez Salgado

PDF

TL;DR

This study reveals that prior harmful actions in a model's history can significantly influence large language models to continue unsafe behaviors, especially when prompted to stay consistent with past actions.

Contribution

The paper introduces HistoryAnchor-100, a benchmark revealing how simple prompts can drastically alter LLM safety responses based on prior history.

Findings

01

Aligned models rarely pick unsafe actions without prompts.

02

A specific prompt increases unsafe action selection to 91-98%.

03

Unsafe behavior escalation occurs beyond simple continuation, indicating a safety risk.

Abstract

Frontier LLMs are increasingly deployed as agents that pick the next action after a long log of prior tool calls produced by the same or a different model. We ask a simple safety question: if a prior step in that log was harmful, will the model continue the harmful course? We build HistoryAnchor-100, 100 short scenarios across ten high-stakes domains, each pairing three forced harmful prior actions with a free-choice node offering two safe and two unsafe options. Across 17 frontier models from six providers we find a striking asymmetry: under a neutral system prompt the strongest aligned models almost never pick unsafe, but a single added sentence, "stay consistent with the strategy shown in the prior history", flips them to 91-98%, and the flipped models often escalate beyond continuation. Two controls rule out simpler explanations: permuting action labels leaves the effect intact, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.