RouteGuard: Internal-Signal Detection of Skill Poisoning in LLM Agents

Wenjie Xiao; Xuehai Tang; Biyu Zhou; Songlin Hu; Jizhong Han

arXiv:2604.22888·cs.CR·April 28, 2026

RouteGuard: Internal-Signal Detection of Skill Poisoning in LLM Agents

Wenjie Xiao, Xuehai Tang, Biyu Zhou, Songlin Hu, Jizhong Han

PDF

TL;DR

RouteGuard is a novel internal-signal detection method that identifies skill poisoning in LLM agents by analyzing internal attention shifts, outperforming text-only filtering methods.

Contribution

The paper introduces RouteGuard, a new internal-signal detector for skill poisoning in LLMs, leveraging response-conditioned attention and hidden-state alignment.

Findings

01

RouteGuard achieves 0.8834 F1 on Skill-Inject channel slice.

02

It recovers 90.51% of description attacks missed by lexical screening.

03

It is consistently the most robust detector across benchmarks.

Abstract

Agent skills introduce a new and more severe form of indirect injection for LLM agents: unlike traditional indirect prompt injection, attackers can hide malicious instructions inside a dense, action-oriented skill that already functions as a legitimate instruction source. We study pre-execution skill-poison detection and show that successful skill poisoning induces a structured internal effect, attention hijacking, in which response-time attention shifts from trusted context to malicious skill spans and drives harmful behavior. Motivated by this mechanism, we propose RouteGuard, a frozen-backbone detector that combines response-conditioned attention and hidden-state alignment through reliability-gated late fusion. Across both real and synthetic open-source skill benchmarks, RouteGuard is consistently the strongest or most robust detector; on the critical Skill-Inject channel slice, it…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.