Jailbreaking LLMs Without Gradients or Priors: Effective and Transferable Attacks
Zhakshylyk Nurlanov, Frank R. Schmidt, Florian Bernard

TL;DR
This paper introduces RAILS, a novel gradient-free attack framework that effectively finds adversarial jailbreaks for large language models, demonstrating high success and transferability without relying on priors or white-box access.
Contribution
RAILS is the first to perform effective token-level adversarial attacks without gradients or priors, enabling cross-tokenizer transferability and improving robustness evaluation.
Findings
RAILS achieves near 100% success on open-source models.
It demonstrates high transferability to closed-source systems like GPT.
The method outperforms existing gradient-based attack techniques.
Abstract
As Large Language Models (LLMs) are increasingly deployed in safety-critical domains, rigorously evaluating their robustness against adversarial jailbreaks is essential. However, current safety evaluations often overestimate robustness because existing automated attacks are limited by restrictive assumptions. They typically rely on handcrafted priors or require white-box access for gradient propagation. We challenge these constraints by demonstrating that token-level iterative optimization can succeed without gradients or priors. We introduce RAILS (RAndom Iterative Local Search), a framework that operates solely on model logits. RAILS matches the effectiveness of gradient-based methods through two key innovations: a novel auto-regressive loss that enforces exact prefix matching, and a history-based selection strategy that bridges the gap between the proxy optimization objective and the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
