When Refusals Fail: Unstable Safety Mechanisms in Long-Context LLM Agents

Tsimur Hadeliya; Mohammad Ali Jauhar; Nidhi Sakpal; Diogo Cruz

arXiv:2512.02445·cs.LG·December 3, 2025

When Refusals Fail: Unstable Safety Mechanisms in Long-Context LLM Agents

Tsimur Hadeliya, Mohammad Ali Jauhar, Nidhi Sakpal, Diogo Cruz

PDF

Open Access

TL;DR

This paper investigates how long-context LLM agents experience performance and safety issues as context length increases, revealing significant degradation and unpredictable safety behavior at extended context sizes.

Contribution

It uncovers the sensitivity of LLM agents to context length and placement, highlighting safety concerns and performance drops not previously explored in agentic settings.

Findings

01

Performance drops over 50% at 100K tokens for some models.

02

Refusal rates vary unpredictably with context length, e.g., GPT-4.1-nano and Grok 4 Fast.

03

Longer contexts cause safety and capability divergences from prior LLM evaluations.

Abstract

Solving complex or long-horizon problems often requires large language models (LLMs) to use external tools and operate over a significantly longer context window. New LLMs enable longer context windows and support tool calling capabilities. Prior works have focused mainly on evaluation of LLMs on long-context prompts, leaving agentic setup relatively unexplored, both from capability and safety perspectives. Our work addresses this gap. We find that LLM agents could be sensitive to length, type, and placement of the context, exhibiting unexpected and inconsistent shifts in task performance and in refusals to execute harmful requests. Models with 1M-2M token context windows show severe degradation already at 100K tokens, with performance drops exceeding 50\% for both benign and harmful tasks. Refusal rates shift unpredictably: GPT-4.1-nano increases from $\sim$ 5\% to $\sim$ 40\% while Grok…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Natural Language Processing Techniques