Do Reasoning LLMs Refuse What They Infer in Long Contexts?
Yu Fu, Haz Sameen Shahgir, Huanli Gong, Zhipeng Wei, N. Benjamin Erichson, Yue Dong

TL;DR
This paper investigates how long-context language models can infer harmful objectives from incomplete information, revealing a safety gap where models often fail to refuse harmful inferences in complex reasoning scenarios.
Contribution
It introduces compositional reasoning attacks to evaluate the safety of LLMs in long contexts, highlighting their limitations in refusing inferred harmful requests.
Findings
Models refuse direct harmful requests effectively.
Refusal rates drop when harmful objectives are reconstructed compositionally.
Longer contexts increase the likelihood of harmful inferences and failures to refuse.
Abstract
Long-context LLMs can infer objectives that are not stated explicitly. This capability is useful for reasoning over documents, code, retrieved evidence, and tool traces, but it also creates a safety risk: harmful intent can be distributed across a context and become visible only after the model composes the relevant pieces. Existing safety evaluations mostly test explicit harmful requests, and therefore miss this failure mode. We introduce compositional reasoning attacks, a long-context threat model in which harmful requests are decomposed into semantically incomplete fragments and embedded in long contexts. The final query is neutral; the harmful objective emerges only if the model retrieves the fragments, composes them, and infers the implied goal. We instantiate this setting using AdvBench requests, varying the required reasoning from Direct Retrieval to Single-hop Aggregation, Chain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
