Mass-Scale Analysis of In-the-Wild Conversations Reveals Complexity Bounds on LLM Jailbreaking
Aldan Creo, Raul Castro Fernandez, Manuel Cebrian

TL;DR
This study analyzes over 2 million real-world conversations to understand the complexity of LLM jailbreak strategies, revealing that attack complexity is bounded and not escalating as previously thought, with implications for AI safety.
Contribution
It provides the first large-scale empirical analysis of jailbreak complexity across diverse platforms, challenging the idea of an escalating arms race in LLM safety.
Findings
Jailbreak attempts do not show higher complexity than normal conversations.
User attack toxicity and complexity remain stable over time.
Assistant response toxicity has decreased, indicating improved safety measures.
Abstract
As large language models (LLMs) become increasingly deployed, understanding the complexity and evolution of jailbreaking strategies is critical for AI safety. We present a mass-scale empirical analysis of jailbreak complexity across over 2 million real-world conversations from diverse platforms, including dedicated jailbreaking communities and general-purpose chatbots. Using a range of complexity metrics spanning probabilistic measures, lexical diversity, compression ratios, and cognitive load indicators, we find that jailbreak attempts do not exhibit significantly higher complexity than normal conversations. This pattern holds consistently across specialized jailbreaking communities and general user populations, suggesting practical bounds on attack sophistication. Temporal analysis reveals that while user attack toxicity and complexity remains stable over time, assistant response…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
