# Forewarned is Forearmed: Pre-Synthesizing Jailbreak-like Instructions to Enhance LLM Safety Guardrail to Potential Attacks

**Authors:** Sheng Liu, Qiang Sheng, Danding Wang, Yang Li, Guang Yang, Juan Cao

arXiv: 2508.20038 · 2025-09-05

## TL;DR

This paper introduces IMAGINE, a framework that synthesizes jailbreak-like instructions to improve large language models' safety guardrails against unseen malicious attacks by filling distributional gaps.

## Contribution

It presents a novel embedding-based synthesis method that enhances safety training data, reducing jailbreak success rates without harming model utility.

## Key findings

- Significant reduction in attack success rates on multiple LLMs.
- Effective augmentation of safety data with synthesized jailbreak instructions.
- Maintains model utility while improving safety defenses.

## Abstract

Despite advances in improving large language model (LLM) to refuse to answer malicious instructions, widely used LLMs remain vulnerable to jailbreak attacks where attackers generate instructions with distributions differing from safety alignment corpora. New attacks expose LLMs' inability to recognize unseen malicious instructions, highlighting a critical distributional mismatch between training data and real-world attacks that forces developers into reactive patching cycles. To tackle this challenge, we propose IMAGINE, a synthesis framework that leverages embedding space distribution analysis to generate jailbreak-like instructions. This approach effectively fills the distributional gap between authentic jailbreak patterns and safety alignment corpora. IMAGINE follows an iterative optimization process that dynamically evolves text generation distributions across iterations, thereby augmenting the coverage of safety alignment data distributions through synthesized data examples. Based on the safety-aligned corpus enhanced through IMAGINE, our framework demonstrates significant decreases in attack success rate on Qwen2.5, Llama3.1, and Llama3.2 without compromising their utility.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20038/full.md

## Figures

11 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20038/full.md

## References

44 references — full list in the complete paper: https://tomesphere.com/paper/2508.20038/full.md

---
Source: https://tomesphere.com/paper/2508.20038