Behavioral Integrity Verification for AI Agent Skills
Yuhao Wu, Tung-Ling Li, Hongliang Liu

TL;DR
This paper introduces a framework for verifying that AI agent skills behave as declared, using code analysis and LLM-assisted extraction, to improve safety and detect malicious capabilities at scale.
Contribution
It formalizes the behavioral integrity verification problem and develops a scalable framework combining code analysis and LLMs for skill validation and malicious detection.
Findings
80.0% of skills deviate from declared behavior, indicating a description-implementation gap.
Most deviations are due to developer oversight (81.1%) rather than malicious intent.
BIV achieves an F1 score of 0.946 on malicious-skill detection, outperforming baselines.
Abstract
Agent skills extend LLM agents with privileged third-party capabilities such as filesystem access, credentials, network calls, and shell execution. Existing safety work catches malicious prompts and risky runtime actions, but the skill artifact itself goes unverified. We formalize this as the behavioral integrity verification (BIV) problem: a typed set comparison between declared and actual capabilities over a shared taxonomy that bridges code, instructions, and metadata. The BIV framework instantiates this comparison by pairing deterministic code analysis with LLM-assisted capability extraction. The resulting structured evidence supports three downstream analyses: deviation taxonomy, root-cause classification, and malicious-skill detection. On 49,943 skills from the OpenClaw registry, the deviation taxonomy reveals a pervasive description-implementation gap: 80.0% of skills deviate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
