JULI: Jailbreak Large Language Models by Self-Introspection

Jesson Wang; Zhanhao Hu; David Wagner

arXiv:2505.11790·cs.LG·March 11, 2026

JULI: Jailbreak Large Language Models by Self-Introspection

Jesson Wang, Zhanhao Hu, David Wagner

PDF

Open Access 3 Reviews

TL;DR

This paper introduces JULI, a novel method to jailbreak safety-aligned large language models by manipulating token probabilities using a small plug-in, effective even in black-box API settings.

Contribution

JULI is the first approach to successfully jailbreak API-based LLMs using only token log probabilities, outperforming existing methods.

Findings

01

Effective in black-box API settings

02

Outperforms state-of-the-art methods

03

Works with only top-5 token log probabilities

Abstract

Large Language Models (LLMs) are trained with safety alignment to prevent generating malicious content. Although some attacks have highlighted vulnerabilities in these safety-aligned LLMs, they typically have limitations, such as necessitating access to the model weights or the generation process. Since proprietary models through API-calling do not grant users such permissions, these attacks find it challenging to compromise them. In this paper, we propose Jailbreaking Using LLM Introspection (JULI), which jailbreaks LLMs by manipulating the token log probabilities, using a tiny plug-in block, BiasNet. JULI relies solely on the knowledge of the target LLM's predicted token log probabilities. It can effectively jailbreak API-calling LLMs under a black-box setting and knowing only top- $5$ token log probabilities. Our approach demonstrates superior effectiveness, outperforming existing…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- Clear idea: Manipulating the next-token distribution using a lightweight network is simple in API settings. - Efficiency: BiasNet trains with 100 harmful QA pairs and fewer than 1% of the target LLM parameters with extremely low inference time. - Useful visualization and intuition. Figure 3 shows that BiasNet sparsely shifts distributions, with larger KL changes near critical positions such as response starts, and minimal perturbations later. - Interesting risk observation: Figure 2 suggests

Weaknesses

- Training objective is under-specified and potentially inconsistent with the inference-time mechanism. - In Section 4.3, the training loss is written as $\mathbf{min}\_{\theta} E\_{(x,y)\sim L} [CE(F_{\theta} (x), y)]$. Earlier, $F_{\theta}(x)$ is defined to output a "logit bias" B that is added to the base log probabilities. - The paper does not define any regularization on B, no norm or temperature constraint. Unconstrained biases can dominate $\mathbb{log} p_{\alpha}$ and degrade flue

Reviewer 02Rating 4Confidence 5

Strengths

1. JULI avoids the complex iterative optimization typical of GCG-style attacks and does not require access to the target model’s weights, making it a realistic API-side jailbreak when limited signals (e.g., top-k log-probs) are available. 2. The BiasNet architecture and training recipe are straightforward and well-specified, which supports reproducibility and lowers the barrier for independent verification and follow-up work. 3. Experiments show JULI achieves the best harmful scores in open-w

Weaknesses

Major Concerns: 1. Is JULI actually jailbreaking the model, or just generating harmful text via a trained adapter? Since JULI’s BiasNet is trained on harmful Q/A data, it seems that JULI is manufacturing harmful text via BiasNet rather than truly eliciting harmful responses related to the harmful prompt. From my perspective, I think the jailbreak should unlock harmful capabilities that already exist in the target LLM, not inject a harmful generator. I am worried that the biasnet trained with ha

Reviewer 03Rating 6Confidence 4

Strengths

Practical Threat Model: Unlike many jailbreaks requiring full model weights or gradient access, JULI is viable against commercial APIs that only expose top-k log probabilities (e.g., top-5), making it a realistic real-world threat. Effectiveness Against Defenses: JULI demonstrates remarkable robustness, successfully jailbreaking models fortified with state-of-the-art defenses like Circuit Breakers, where other baselines fail.

Weaknesses

- How does this method work against input/output based filter? - How does the sampling work for model APIs? Since JULI samples a different token than what the API would sample, how are subsequent tokens sampled? - How exactly is Figure 2 computed? What is the data? How is the rate computed?

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Topic Modeling