Adaptive Probe-based Steering for Robust LLM Jailbreaking

Junxi Chen; Junhao Dong; Xiaohua Xie

arXiv:2605.20286·cs.CR·May 21, 2026

Adaptive Probe-based Steering for Robust LLM Jailbreaking

Junxi Chen, Junhao Dong, Xiaohua Xie

PDF

1 Repo

TL;DR

This paper introduces an adaptive probe-based steering method for LLM jailbreaking that enhances robustness and effectiveness without extra prompts or manual tuning, by guiding steering vectors through model extraction techniques.

Contribution

It proposes a novel adaptive tuning approach for steering vectors based on contrastive activation statistics, improving attack robustness and success rate.

Findings

01

Significantly increases harmfulness score from 6% to 70%.

02

Eliminates need for additional prompts and manual tuning.

03

Demonstrates improved robustness and effectiveness of jailbreaking methods.

Abstract

Recent work has demonstrated the potential of contrastive steering for jailbreaking Large Language Models (LLMs). However, existing methods rely on limited and inherently biased contrastive prompts and require laborious manual tuning of steering strength, limiting their robustness and effectiveness. In this paper, we leverage the idea of model extraction to guide the learned steering vectors to approximate the ideal one and propose tuning the steering strength adaptively based on contrastive activations' statistics. Experiments demonstrate that our method notably improves the effectiveness and robustness of probe-based steering, without any extra contrastive prompts or laborious manual tuning. Being an attack paper, this paper focuses on revealing the breakdown of fortified LLMs, raising the average harmfulness score from 6\% to 70\%. Our code is available at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fhdnskfbeuv/adaptiveSteering
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.