Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs
Alexander Panfilov, Peter Romov, Igor Shilov, Yves-Alexandre de Montjoye, Jonas Geiping, and Maksym Andriushchenko

TL;DR
This paper introduces Claudini, an autoresearch pipeline using LLMs to discover novel, highly effective adversarial attack algorithms for language models, outperforming existing methods in jailbreaking and prompt injection tasks.
Contribution
The paper presents a method for automated discovery of adversarial attacks that outperform existing algorithms, demonstrating the potential for LLMs to advance security research autonomously.
Findings
Discovered attack algorithms achieve up to 40% success rate on CBRN queries.
Attacks generalize across models, achieving 100% success on Meta-SecAlign-70B.
Automated attack discovery outperforms all 30+ existing methods.
Abstract
LLM agents like Claude Code can not only write code but also be used for autonomous AI research and engineering \citep{rank2026posttrainbench, novikov2025alphaevolve}. We show that an \emph{autoresearch}-style pipeline \citep{karpathy2026autoresearch} powered by Claude Code discovers novel white-box adversarial attack \textit{algorithms} that \textbf{significantly outperform all existing (30+) methods} in jailbreaking and prompt injection evaluations. Starting from existing attack implementations, such as GCG~\citep{zou2023universal}, the agent iterates to produce new algorithms achieving up to 40\% attack success rate on CBRN queries against GPT-OSS-Safeguard-20B, compared to 10\% for existing algorithms (\Cref{fig:teaser}, left). The discovered algorithms generalize: attacks optimized on surrogate models transfer directly to held-out models, achieving \textbf{100\% ASR…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Spam and Phishing Detection
