TL;DR
SkillSieve is a hierarchical detection framework that efficiently identifies malicious AI agent skills by combining regex, static analysis, and multi-layered LLM evaluations, significantly improving accuracy over previous methods.
Contribution
It introduces a novel three-layer detection system that progressively applies analysis, leveraging LLMs with parallel sub-tasks and voting, to detect security vulnerabilities in AI agent skills.
Findings
Filters 86% of benign skills in under 40ms at zero API cost.
Achieves 0.800 F1 score on a benchmark, outperforming prior work.
Operates effectively on real-world skills and adversarial samples.
Abstract
OpenClaw's ClawHub marketplace hosts over 13,000 community-contributed agent skills, and between 13% and 26% of them contain security vulnerabilities according to recent audits. Regex scanners miss obfuscated payloads; formal static analyzers cannot read the natural language instructions in SKILL.md files where prompt injection and social engineering attacks hide. Neither approach handles both modalities. SkillSieve is a three-layer detection framework that applies progressively deeper analysis only where needed. Layer 1 runs regex, AST, and metadata checks through an XGBoost-based feature scorer, filtering roughly 86% of benign skills in under 40ms on average at zero API cost. Layer 2 sends suspicious skills to an LLM, but instead of asking one broad question, it splits the analysis into four parallel sub-tasks (intent alignment, permission justification, covert behavior detection,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
