JULI: Jailbreak Large Language Models by Self-Introspection
Jesson Wang, Zhanhao Hu, David Wagner

TL;DR
This paper introduces JULI, a novel method to jailbreak safety-aligned large language models by manipulating token probabilities using a small plug-in, effective even in black-box API settings.
Contribution
JULI is the first approach to successfully jailbreak API-based LLMs using only token log probabilities, outperforming existing methods.
Findings
Effective in black-box API settings
Outperforms state-of-the-art methods
Works with only top-5 token log probabilities
Abstract
Large Language Models (LLMs) are trained with safety alignment to prevent generating malicious content. Although some attacks have highlighted vulnerabilities in these safety-aligned LLMs, they typically have limitations, such as necessitating access to the model weights or the generation process. Since proprietary models through API-calling do not grant users such permissions, these attacks find it challenging to compromise them. In this paper, we propose Jailbreaking Using LLM Introspection (JULI), which jailbreaks LLMs by manipulating the token log probabilities, using a tiny plug-in block, BiasNet. JULI relies solely on the knowledge of the target LLM's predicted token log probabilities. It can effectively jailbreak API-calling LLMs under a black-box setting and knowing only top- token log probabilities. Our approach demonstrates superior effectiveness, outperforming existing…
Peer Reviews
Decision·ICLR 2026 Poster
- Clear idea: Manipulating the next-token distribution using a lightweight network is simple in API settings. - Efficiency: BiasNet trains with 100 harmful QA pairs and fewer than 1% of the target LLM parameters with extremely low inference time. - Useful visualization and intuition. Figure 3 shows that BiasNet sparsely shifts distributions, with larger KL changes near critical positions such as response starts, and minimal perturbations later. - Interesting risk observation: Figure 2 suggests
- Training objective is under-specified and potentially inconsistent with the inference-time mechanism. - In Section 4.3, the training loss is written as $\mathbf{min}\_{\theta} E\_{(x,y)\sim L} [CE(F_{\theta} (x), y)]$. Earlier, $F_{\theta}(x)$ is defined to output a "logit bias" B that is added to the base log probabilities. - The paper does not define any regularization on B, no norm or temperature constraint. Unconstrained biases can dominate $\mathbb{log} p_{\alpha}$ and degrade flue
1. JULI avoids the complex iterative optimization typical of GCG-style attacks and does not require access to the target model’s weights, making it a realistic API-side jailbreak when limited signals (e.g., top-k log-probs) are available. 2. The BiasNet architecture and training recipe are straightforward and well-specified, which supports reproducibility and lowers the barrier for independent verification and follow-up work. 3. Experiments show JULI achieves the best harmful scores in open-w
Major Concerns: 1. Is JULI actually jailbreaking the model, or just generating harmful text via a trained adapter? Since JULI’s BiasNet is trained on harmful Q/A data, it seems that JULI is manufacturing harmful text via BiasNet rather than truly eliciting harmful responses related to the harmful prompt. From my perspective, I think the jailbreak should unlock harmful capabilities that already exist in the target LLM, not inject a harmful generator. I am worried that the biasnet trained with ha
Practical Threat Model: Unlike many jailbreaks requiring full model weights or gradient access, JULI is viable against commercial APIs that only expose top-k log probabilities (e.g., top-5), making it a realistic real-world threat. Effectiveness Against Defenses: JULI demonstrates remarkable robustness, successfully jailbreaking models fortified with state-of-the-art defenses like Circuit Breakers, where other baselines fail.
- How does this method work against input/output based filter? - How does the sampling work for model APIs? Since JULI samples a different token than what the API would sample, how are subsequent tokens sampled? - How exactly is Figure 2 computed? What is the data? How is the rate computed?
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Topic Modeling
