An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models
Hanrui Luo, Shreyank N Gowda

TL;DR
This study empirically evaluates multi-generation sampling methods for detecting jailbreak behaviors in large language models, highlighting the importance of sampling strategies and cross-model generalization.
Contribution
It provides a comprehensive analysis of output-based jailbreak detection techniques under realistic sampling conditions, emphasizing the benefits of moderate multi-sample auditing.
Findings
Single output evaluation underestimates jailbreak vulnerability.
Moderate sampling significantly improves detection accuracy.
Detection signals partially transfer across related models.
Abstract
Detecting jailbreak behaviour in large language models remains challenging, particularly when strongly aligned models produce harmful outputs only rarely. In this work, we present an empirical study of output based jailbreak detection under realistic conditions using the JailbreakBench Behaviors dataset and multiple generator models with varying alignment strengths. We evaluate both a lexical TF-IDF detector and a generation inconsistency based detector across different sampling budgets. Our results show that single output evaluation systematically underestimates jailbreak vulnerability, as increasing the number of sampled generations reveals additional harmful behaviour. The most significant improvements occur when moving from a single generation to moderate sampling, while larger sampling budgets yield diminishing returns. Cross generator experiments demonstrate that detection signals…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
