Quantifying LLM Safety Degradation Under Repeated Attacks Using Survival Analysis
Zvi Topol

TL;DR
This paper introduces a survival analysis-based framework to evaluate the temporal vulnerability of large language models to adversarial jailbreak attacks, capturing how safety degrades over repeated attempts.
Contribution
It applies survival analysis to model LLM jailbreak success over time, providing a new, rigorous evaluation method beyond binary success metrics.
Findings
One model shows rapid vulnerability increase under repeated attacks.
Two models exhibit moderate, consistent vulnerability.
The framework offers actionable insights for improving LLM safety.
Abstract
Large language models (LLMs) are increasingly deployed in a wide range of applications, yet remain vulnerable to adversarial jailbreak attacks that circumvent their safety guardrails. Existing evaluation frameworks typically report binary success/failure metrics, failing to capture the temporal dynamics of how attacks succeed under persistent adversarial pressure. This preliminary work proposes a novel evaluation framework that applies survival analysis techniques to characterize LLM jailbreak vuln`erability. Our approach models the time-to-jailbreak as a survival outcome, enabling estimation of hazard functions, survival curves, and risk factors associated with successful attacks. We evaluate three LLMs against a subset of prompts from the HarmBench dataset spanning three attack categories. Our analysis reveals that models exhibit distinct vulnerability profiles: while one model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
