RouteGuard: Internal-Signal Detection of Skill Poisoning in LLM Agents
Wenjie Xiao, Xuehai Tang, Biyu Zhou, Songlin Hu, Jizhong Han

TL;DR
RouteGuard is a novel internal-signal detection method that identifies skill poisoning in LLM agents by analyzing internal attention shifts, outperforming text-only filtering methods.
Contribution
The paper introduces RouteGuard, a new internal-signal detector for skill poisoning in LLMs, leveraging response-conditioned attention and hidden-state alignment.
Findings
RouteGuard achieves 0.8834 F1 on Skill-Inject channel slice.
It recovers 90.51% of description attacks missed by lexical screening.
It is consistently the most robust detector across benchmarks.
Abstract
Agent skills introduce a new and more severe form of indirect injection for LLM agents: unlike traditional indirect prompt injection, attackers can hide malicious instructions inside a dense, action-oriented skill that already functions as a legitimate instruction source. We study pre-execution skill-poison detection and show that successful skill poisoning induces a structured internal effect, attention hijacking, in which response-time attention shifts from trusted context to malicious skill spans and drives harmful behavior. Motivated by this mechanism, we propose RouteGuard, a frozen-backbone detector that combines response-conditioned attention and hidden-state alignment through reliability-gated late fusion. Across both real and synthetic open-source skill benchmarks, RouteGuard is consistently the strongest or most robust detector; on the critical Skill-Inject channel slice, it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
