Does Refusal Training in LLMs Generalize to the Past Tense?

Maksym Andriushchenko; Nicolas Flammarion

arXiv:2407.11969·cs.CL·April 21, 2025·6 cites

Does Refusal Training in LLMs Generalize to the Past Tense?

Maksym Andriushchenko, Nicolas Flammarion

PDF

Open Access 1 Repo 1 Models 1 Datasets 3 Reviews

TL;DR

Refusal training in large language models often fails to prevent harmful outputs when requests are reformulated in the past tense, revealing a significant generalization gap in current alignment techniques.

Contribution

This paper systematically evaluates the effectiveness of refusal training against past tense reformulations and demonstrates the brittleness of current alignment methods.

Findings

01

Past tense reformulations significantly increase jailbreak success rates.

02

Refusal guardrails are less effective on future tense and hypothetical questions.

03

Including past tense examples in fine-tuning improves model robustness against reformulation attacks.

Abstract

Refusal training is widely used to prevent LLMs from generating harmful, undesirable, or illegal outputs. We reveal a curious generalization gap in the current refusal training approaches: simply reformulating a harmful request in the past tense (e.g., "How to make a Molotov cocktail?" to "How did people make a Molotov cocktail?") is often sufficient to jailbreak many state-of-the-art LLMs. We systematically evaluate this method on Llama-3 8B, Claude-3.5 Sonnet, GPT-3.5 Turbo, Gemma-2 9B, Phi-3-Mini, GPT-4o mini, GPT-4o, o1-mini, o1-preview, and R2D2 models using GPT-3.5 Turbo as a reformulation model. For example, the success rate of this simple attack on GPT-4o increases from 1% using direct requests to 88% using 20 past tense reformulation attempts on harmful requests from JailbreakBench with GPT-4 as a jailbreak judge. Interestingly, we also find that reformulations in the future…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

Comprehensive evaluation across multiple leading LLMs and different types of harmful requests, with a clear demonstration of the vulnerability that wasn't previously well-documented. Provides concrete evidence through systematic testing and multiple evaluation metrics used widely in the adversarial robustness field, and demonstrates a clear strategy for mitigating this threat through better finetuning.

Weaknesses

1. Lack of evaluation on other languages

Reviewer 02Rating 6Confidence 4

Strengths

**Novel Insight into Refusal Training**: The paper identifies a specific, under-explored vulnerability in LLM refusal training—namely, that past-tense reformulations can bypass safety mechanisms. This insight into linguistic generalization gaps is valuable for improving the robustness of refusal training. **Thorough Empirical Validation**: By evaluating the past-tense attack across a wide range of advanced models (e.g., GPT-3.5 Turbo, Claude-3.5, GPT-4o), the authors provide convincing evidence

Weaknesses

**Limited Solution Exploration**: Although the paper identifies a clear vulnerability, the proposed solution—incorporating past-tense examples in training—is relatively basic and may not address other similar reformulations or linguistic variations. **Lack of Theoretical Analysis**: No theoretical insight is given for why a generalization gap between past-tense and present-tense, which is more interesting and can deepen our understanding to eliminate other underexplored vulnerabilities.

Reviewer 03Rating 1Confidence 5

Strengths

-- Alignment of LLMs is an important domain of research, and the more vulnerabilities that are found, the better it is for researchers and model providers to patch them. This paper presents a cost-effective jailbreaking method that paraphrases harmful prompts in the past tense to attack LLMs, showing that LLM safety training has not generalized to past-tense formulations. -- The efficacy and simplicity of this attack across various models highlight the urgency of addressing it.

Weaknesses

1. While the attack is simple and cost-effective, paraphrasing attacks have proven effective in the past. --> One possible reason for this jailbreak's success might be the lack of generalization (or explicit safety training) in handling past tense harmful prompts (as mentioned in the paper). --> This does not appear to be a novel type of attack. --> It would be helpful to know whether this attack was discovered through systematic investigation / brute-force testing. --> In my opinion, the

Code & Models

Repositories

tml-epfl/llm-past-tense
pytorchOfficial

Models

🤗
CTCT-CT2/changeway_guardrails
model· 10 dl· ♡ 2
10 dl♡ 2

Datasets

HPAI-BSC/Egida
dataset· 285 dl
285 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Label Smoothing · Linear Layer · Weight Decay · Softmax · Position-Wise Feed-Forward Layer