Stop Testing Attacks, Start Diagnosing Defenses: The Four-Checkpoint Framework Reveals Where LLM Safety Breaks
Hayfa Dhabhi, Kashyap Thimmaraju

TL;DR
This paper introduces the Four-Checkpoint Framework to systematically evaluate and diagnose vulnerabilities in LLM safety mechanisms, revealing that current defenses are weakest at output stages and intent detection.
Contribution
The paper proposes a novel four-checkpoint framework for analyzing LLM safety, along with 13 targeted evasion techniques and a severity-adjusted metric, WASR, for comprehensive vulnerability assessment.
Findings
Output-stage defenses are most vulnerable with 72-79% WASR.
Input-literal defenses are strongest with 13% WASR.
Traditional binary metrics underestimate true vulnerability, which is over 50%.
Abstract
Large Language Models (LLMs) deploy safety mechanisms to prevent harmful outputs, yet these defenses remain vulnerable to adversarial prompts. While existing research demonstrates that jailbreak attacks succeed, it does not explain \textit{where} defenses fail or \textit{why}. To address this gap, we propose that LLM safety operates as a sequential pipeline with distinct checkpoints. We introduce the \textbf{Four-Checkpoint Framework}, which organizes safety mechanisms along two dimensions: processing stage (input vs.\ output) and detection level (literal vs.\ intent). This creates four checkpoints, CP1 through CP4, each representing a defensive layer that can be independently evaluated. We design 13 evasion techniques, each targeting a specific checkpoint, enabling controlled testing of individual defensive layers. Using this framework, we evaluate GPT-5, Claude Sonnet 4, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Information and Cyber Security · Security and Verification in Computing
