From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment
Kyubyung Chae, Hyunbin Jin, Taesup Kim

TL;DR
This paper introduces RAAI, a simple, training-free framework that uses attack techniques to generate synthetic data, improving LLM safety alignment by making models more robust against harmful prompts.
Contribution
We propose RAAI, a novel, model-agnostic method that leverages refusal signals to inject harmful phrases, enabling scalable safety data generation without complex prompting or auxiliary models.
Findings
RAAI significantly increases harmful response rates in jailbreak tests.
Synthetic data from RAAI enhances LLM robustness against harmful prompts.
The method maintains model performance on standard benchmarks.
Abstract
Safely aligning large language models (LLMs) often demands extensive human-labeled preference data, a process that's both costly and time-consuming. While synthetic data offers a promising alternative, current methods frequently rely on complex iterative prompting or auxiliary models. To address this, we introduce Refusal-Aware Adaptive Injection (RAAI), a straightforward, training-free, and model-agnostic framework that repurposes LLM attack techniques. RAAI works by detecting internal refusal signals and adaptively injecting predefined phrases to elicit harmful, yet fluent, completions. Our experiments show RAAI effectively jailbreaks LLMs, increasing the harmful response rate from a baseline of 2.15% to up to 61.04% on average across four benchmarks. Crucially, fine-tuning LLMs with the synthetic data generated by RAAI improves model robustness against harmful prompts while…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The injection technique for circumventing the safety alignment of LLMs is simple and effective. 2. Ablations of injection phrases and refusal thresholds are performed. 3. Fine-tuning on synthetic data from RAAI improves safety without sacrificing much model utility.
1. The novelty of Refusal-Aware Adaptive Injection is quite limited. It is essentially a more general version of the prefilling attack where the harmful prefill can be inserted mid-generation based on some simple rules. In fact, prior work has explored similar ideas of alternating between generation and injection of tokens, such as in [1]. I would suggest including [1] as a baseline in your experiments. 2. The citations for prefilling attacks misses some critical works, namely [2] and [3]. 3. Th
- The paper is well-written and organized, making the idea easy to follow. - The idea of turning attacking into an alignment data generation pipeline is interesting and proves effective.
- As an attacking method, the framework exhibits several critical limitations concerning refusal token replacements: - For commercial models, access to token probability distributions is often restricted, preventing the application of this method to commercial deployments where such attacks would constitute a genuine security concern. - The refusal token selection process is arbitrary and depends on manual inspection. Moreover, Table 4 reveals inconsistencies in token classification—for in
1. the paper is well written and easy to follow 2. experiments show that the proposed method outperform baseline methods by a large margin. 3. the generated harmful contents can also be utilized to further fine-tune the model for improved safety performance.
1. the compared baseline methods are all outdated ones from 2023 and 2024. 2. the contribution of the paper is weak. the main contribution is to replace a token to be decoded with predefined sequences. the rest of the methods such as utilizing SimPO is just a simple application. 3. although the proposed method is branded as gray-box, it requires access to the decoding process of LLMs which seems impossible for production models such as GPT or Gemini. Thus the application scenario of the proposed
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Topic Modeling
