TL;DR
This paper introduces Checkpoint-GCG, a white-box attack method that leverages intermediate checkpoints during fine-tuning to effectively bypass prompt injection defenses in large language models, revealing significant security vulnerabilities.
Contribution
We propose Checkpoint-GCG, a novel attack that improves upon existing methods by using intermediate model checkpoints to enhance attack success against fine-tuning-based defenses.
Findings
Achieves up to 96% attack success rate against state-of-the-art defenses.
Develops universal suffixes with 89.9% success on unseen inputs.
Transfers attack success to black-box models with 63.9% ASR.
Abstract
Large language models (LLMs) are increasingly deployed in real-world applications ranging from chatbots to agentic systems, where they are expected to process untrusted data and follow trusted instructions. Failure to distinguish between the two poses significant security risks, exploited by prompt injection attacks, which inject malicious instructions into the data to control model outputs. Model-level defenses have been proposed to mitigate prompt injection attacks. These defenses fine-tune LLMs to ignore injected instructions in untrusted data. We introduce Checkpoint-GCG, a white-box attack against fine-tuning-based defenses. Checkpoint-GCG enhances the Greedy Coordinate Gradient (GCG) attack by leveraging intermediate model checkpoints produced during fine-tuning to initialize GCG, with each checkpoint acting as a stepping stone for the next one to continuously improve attacks.…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper is easy to read and well-written. - Checkpoint-GCG addresses a critical gap in LLM security: the failure of traditional attacks to evaluate the robustness of state-of-the-art fine-tuning defenses. By exploiting the parameter updates in fine-tuning checkpoints, it provides a principled solution to GCG’s initialization sensitivity.
- The auditing setting (full access to model checkpoints and the exact input) is realistic for internal red-teaming but not for many real-world attackers. The authors do relax these assumptions, but the highest ASRs require checkpoint access. The practical feasibility of obtaining intermediate checkpoints for deployed proprietary models is limited. - Checkpoint-GCG runs GCG many times across selected checkpoints; while GRAD reduces cost, the totals in reported experiments are nontrivial (per-s
1. The method serves as a useful tool for improving model auditing, since intermediate checkpoints will be available to model providers. 2. The evaluation considers strong baselines for comparison against state-of-the-art defenses and shows that the attack is highly effective and transferable.
A nearly identical approach for jailbreaking has already been published at ICLR 2025. Wang et al. [1] introduced a staged jailbreaking technique that converts a challenging optimization problem (i.e., jailbreaking an aligned model with GCG) into a sequence of easy-to-hard problems, where the solution of each prior problem is used to warm-start the optimization of the next problem. Here, each problem is a model checkpoint, obtained by deliberately misaligning the model, making it easier to attack
1. The idea of leveraging intermediate fine-tuning checkpoints as initialization stages for GCG is largely novel and well-motivated, bridging a gap between training dynamics and attack optimization. 2. The experiments are sufficient, as the authors evaluate multiple model families (Llama-3-8B, Mistral-7B, Qwen2-1.5B), several defense mechanisms (StruQ, SecAlign, SecAlign++), and multiple threat settings (auditing, universal attack, model transferability). 3. The performance of the attacks are im
1. Effectiveness due to additional information: The strong performance of Checkpoint-GCG is not entirely surprising, as it benefits from substantially more information (specifically, access to intermediate fine-tuning checkpoints as part of a defense mechanism) than standard attacks, which gives it an inherent advantage. Throughout the paper, the authors primarily compare Checkpoint-GCG against the standard GCG baseline to emphasize its improvements; while this comparison is acceptable, it may i
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
