When Refusals Fail: Unstable Safety Mechanisms in Long-Context LLM Agents
Tsimur Hadeliya, Mohammad Ali Jauhar, Nidhi Sakpal, Diogo Cruz

TL;DR
This paper investigates how long-context LLM agents experience performance and safety issues as context length increases, revealing significant degradation and unpredictable safety behavior at extended context sizes.
Contribution
It uncovers the sensitivity of LLM agents to context length and placement, highlighting safety concerns and performance drops not previously explored in agentic settings.
Findings
Performance drops over 50% at 100K tokens for some models.
Refusal rates vary unpredictably with context length, e.g., GPT-4.1-nano and Grok 4 Fast.
Longer contexts cause safety and capability divergences from prior LLM evaluations.
Abstract
Solving complex or long-horizon problems often requires large language models (LLMs) to use external tools and operate over a significantly longer context window. New LLMs enable longer context windows and support tool calling capabilities. Prior works have focused mainly on evaluation of LLMs on long-context prompts, leaving agentic setup relatively unexplored, both from capability and safety perspectives. Our work addresses this gap. We find that LLM agents could be sensitive to length, type, and placement of the context, exhibiting unexpected and inconsistent shifts in task performance and in refusals to execute harmful requests. Models with 1M-2M token context windows show severe degradation already at 100K tokens, with performance drops exceeding 50\% for both benign and harmful tasks. Refusal rates shift unpredictably: GPT-4.1-nano increases from 5\% to 40\% while Grok…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Natural Language Processing Techniques
