Defensive M2S: Training Guardrail Models on Compressed Multi-turn Conversations
Hyunjun Kim

TL;DR
Defensive M2S introduces a training method that compresses multi-turn conversations into single-turn formats, significantly reducing computational costs while maintaining high safety guardrail effectiveness for LLMs.
Contribution
The paper proposes M2S compression for guardrail models, reducing training and inference costs by over 90% without sacrificing detection performance.
Findings
93.8% attack detection recall achieved
94.6% reduction in inference tokens
38.9 percentage point improvement over baseline
Abstract
Guardrail models are essential for ensuring the safety of Large Language Model (LLM) deployments, but processing full multi-turn conversation histories incurs significant computational cost. We propose Defensive M2S, a training paradigm that fine-tunes guardrail models on Multi-turn to Single-turn (M2S) compressed conversations rather than complete dialogue histories. We provide a formal complexity analysis showing that M2S reduces training cost from to for -turn conversations. Empirically, on our training dataset (779 samples, avg. 10.6 turns), M2S requires only 169K tokens compared to 15.7M tokens for the multi-turn baseline -- a 93 reduction. We evaluate Defensive M2S across three guardrail model families (LlamaGuard, Nemotron, Qwen3Guard) and three compression templates (hyphenize, numberize, pythonize) on SafeDialBench, a comprehensive multi-turn…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Natural Language Processing Techniques
