Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems
Zhaojiacheng Zhou

TL;DR
Proteus is a grey-box self-evolving red-team framework that measures the adaptive leakage risk of agent skills by iteratively testing, mutating, and expanding attack strategies in a formalized skill-attack space.
Contribution
It introduces a novel self-evolving red-team approach for assessing the residual risk of agent skills against adaptive, feedback-driven attackers.
Findings
Proteus achieves 40-90% attack success rate at 5 rounds.
Phase-2 expansion produces 438 lethal attack variants.
Current skill vetting underestimates residual risk against adaptive attackers.
Abstract
Agent skills extend LLM agents with reusable instructions, tool interfaces, and executable code, and users increasingly install third-party skills from marketplaces, repositories, and community channels. Because a skill exposes both executable behavior and context-setting documentation, its deployment risk cannot be measured by single-shot audits or prompt-level red teams alone: a realistic attacker can use audit and runtime feedback to repeatedly rewrite the skill. We frame this risk as \emph{adaptive leakage} -- whether a budgeted attacker can iteratively revise a skill until it passes audit and produces verified runtime harm -- and present \ours{}, a grey-box self-evolving red-team framework for measuring it. Proteus searches a formalized five-axis skill-attack space. Each candidate is evaluated through a unified audit-sandbox-oracle pipeline that returns structured audit findings…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
