No Attack Required: Semantic Fuzzing for Specification Violations in Agent Skills
Ying Li, Hongbo Wen, Yanju Chen, Hanzhi Liu, Yuan Tian, Yu Feng

TL;DR
This paper introduces Sefz, a semantic fuzzing framework that automatically detects specification violations in agent skills, revealing hidden safety breaches in real-world applications.
Contribution
Sefz is a novel goal-directed semantic fuzzing approach that uncovers previously unknown safety violations in agent skills by translating guardrails into reachability goals.
Findings
Sefz found violations in 29.9% of 402 real-world skills.
26 previously unknown exploitable guardrail violations were identified.
Six common pitfalls explain most of the specification failures.
Abstract
LLM-powered agents can silently delete documents, leak credentials, or transfer funds on a routine user request, not because the agent was attacked, but because the skill it invoked broke its own declared safety rules. We call these specification violations: benign inputs cause a skill to breach the natural-language guardrails in its own specification, typically because the guardrail's semantics are undefined for autonomous execution, or because the implementation silently ignores the documented constraint. These violations are invisible to static analyzers, traditional fuzzers, and prompt-injection defenses alike, yet they undermine the very contract a user trusts when installing a skill. We present Sefz, a goal-directed semantic fuzzing framework that automatically discovers specification violations in agent skills. Sefz translates each guardrail into a reachability goal over an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
