Position: AI Security Policy Should Target Systems, Not Models
Michael A. Riegler, Inga Str\"umke

TL;DR
This paper introduces swarm-attack, an open-source framework where multiple lightweight LLM agents collaboratively perform adversarial testing, revealing safety bypasses and vulnerabilities at minimal cost using commodity hardware.
Contribution
It demonstrates that coordinated multi-agent LLM systems can effectively identify safety breaches and software vulnerabilities, challenging the need for model restrictions.
Findings
Swarm-attack achieved a 45.8% effective harm rate against GPT-4o.
100% vulnerability detection in a C application within four minutes on a MacBook.
Safety bypass and vulnerability discovery are feasible at zero cost with open-source tools.
Abstract
We present swarm-attack, an open-source adversarial testing framework in which multiple lightweight LLM agents coordinate through shared memory, parallel exploration, and evolutionary optimization. Together, our results demonstrate that both safety bypass of frontier models and software vulnerability discovery, i.e., the capability class that motivated restricted release of Anthropic's Mythos Preview, are achievable at effectively zero cost using commodity hardware and openly available models. We report two experiments. In the first, five instances of a 1.2 billion parameter model conducted 225 jailbreak attacks each against GPT-4o and Claude Sonnet~4. Against GPT-4o, the swarm achieved an Effective Harm Rate of 45.8%, producing 49 critical-severity breaches; against Claude Sonnet-4, the Effective Harm Rate was 0% despite a 40% technical success rate. In the second experiment, the same…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
