TrailBlazer: History-Guided Reinforcement Learning for Black-Box LLM Jailbreaking

Sung-Hoon Yoon; Ruizhi Qian; Minda Zhao; Weiyue Li; Mengyu Wang

arXiv:2602.06440·cs.CL·February 9, 2026

TrailBlazer: History-Guided Reinforcement Learning for Black-Box LLM Jailbreaking

Sung-Hoon Yoon, Ruizhi Qian, Minda Zhao, Weiyue Li, Mengyu Wang

PDF

Open Access

TL;DR

This paper introduces TrailBlazer, a history-aware reinforcement learning framework that leverages past vulnerabilities to improve the efficiency and success rate of jailbreaking large language models.

Contribution

It presents a novel history-guided RL approach with an attention mechanism to enhance LLM jailbreak effectiveness and query efficiency.

Findings

01

Achieves state-of-the-art jailbreak success rates.

02

Significantly reduces the number of queries needed.

03

Highlights the importance of historical vulnerability signals.

Abstract

Large Language Models (LLMs) have become integral to many domains, making their safety a critical priority. Prior jailbreaking research has explored diverse approaches, including prompt optimization, automated red teaming, obfuscation, and reinforcement learning (RL) based methods. However, most existing techniques fail to effectively leverage vulnerabilities revealed in earlier interaction turns, resulting in inefficient and unstable attacks. Since jailbreaking involves sequential interactions in which each response influences future actions, reinforcement learning provides a natural framework for this problem. Motivated by this, we propose a history-aware RL-based jailbreak framework that analyzes and reweights vulnerability signals from prior steps to guide future decisions. We show that incorporating historical information alone improves jailbreak success rates. Building on this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Ethics and Social Impacts of AI