Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Alexander Panfilov; Peter Romov; Igor Shilov; Yves-Alexandre de Montjoye; Jonas Geiping; and Maksym Andriushchenko

arXiv:2603.24511·cs.LG·March 26, 2026

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Alexander Panfilov, Peter Romov, Igor Shilov, Yves-Alexandre de Montjoye, Jonas Geiping, and Maksym Andriushchenko

PDF

Open Access

TL;DR

This paper introduces Claudini, an autoresearch pipeline using LLMs to discover novel, highly effective adversarial attack algorithms for language models, outperforming existing methods in jailbreaking and prompt injection tasks.

Contribution

The paper presents a method for automated discovery of adversarial attacks that outperform existing algorithms, demonstrating the potential for LLMs to advance security research autonomously.

Findings

01

Discovered attack algorithms achieve up to 40% success rate on CBRN queries.

02

Attacks generalize across models, achieving 100% success on Meta-SecAlign-70B.

03

Automated attack discovery outperforms all 30+ existing methods.

Abstract

LLM agents like Claude Code can not only write code but also be used for autonomous AI research and engineering \citep{rank2026posttrainbench, novikov2025alphaevolve}. We show that an \emph{autoresearch}-style pipeline \citep{karpathy2026autoresearch} powered by Claude Code discovers novel white-box adversarial attack \textit{algorithms} that \textbf{significantly outperform all existing (30+) methods} in jailbreaking and prompt injection evaluations. Starting from existing attack implementations, such as GCG~\citep{zou2023universal}, the agent iterates to produce new algorithms achieving up to 40\% attack success rate on CBRN queries against GPT-OSS-Safeguard-20B, compared to $\leq$ 10\% for existing algorithms (\Cref{fig:teaser}, left). The discovered algorithms generalize: attacks optimized on surrogate models transfer directly to held-out models, achieving \textbf{100\% ASR…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Spam and Phishing Detection