BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning
Guiyao Tie, Jiawen Shi, Pan Zhou, Lichao Sun

TL;DR
This paper introduces BadSkill, a backdoor attack on agent skills with embedded models, demonstrating high success rates across multiple architectures and highlighting supply-chain risks in agent ecosystems.
Contribution
It presents a novel backdoor attack formulation targeting model-in-skill threats, with a comprehensive evaluation across diverse skills and model architectures.
Findings
Up to 99.5% attack success rate across 8 triggered skills.
A 3% poison rate achieves 91.7% success rate.
Effective across various model scales and text perturbations.
Abstract
Agent ecosystems increasingly rely on installable skills to extend functionality, and some skills bundle learned model artifacts as part of their execution logic. This creates a supply-chain risk that is not captured by prompt injection or ordinary plugin misuse: a third-party skill may appear benign while concealing malicious behavior inside its bundled model. We present BadSkill, a backdoor attack formulation that targets this model-in-skill threat surface. In BadSkill, an adversary publishes a seemingly benign skill whose embedded model is backdoor-fine-tuned to activate a hidden payload only when routine skill parameters satisfy attacker-chosen semantic trigger combinations. To realize this attack, we train the embedded classifier with a composite objective that combines classification loss, margin-based separation, and poison-focused optimization, and evaluate it in an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
