Defensive M2S: Training Guardrail Models on Compressed Multi-turn Conversations

Hyunjun Kim

arXiv:2601.00454·cs.CL·January 5, 2026

Defensive M2S: Training Guardrail Models on Compressed Multi-turn Conversations

Hyunjun Kim

PDF

Open Access

TL;DR

Defensive M2S introduces a training method that compresses multi-turn conversations into single-turn formats, significantly reducing computational costs while maintaining high safety guardrail effectiveness for LLMs.

Contribution

The paper proposes M2S compression for guardrail models, reducing training and inference costs by over 90% without sacrificing detection performance.

Findings

01

93.8% attack detection recall achieved

02

94.6% reduction in inference tokens

03

38.9 percentage point improvement over baseline

Abstract

Guardrail models are essential for ensuring the safety of Large Language Model (LLM) deployments, but processing full multi-turn conversation histories incurs significant computational cost. We propose Defensive M2S, a training paradigm that fine-tunes guardrail models on Multi-turn to Single-turn (M2S) compressed conversations rather than complete dialogue histories. We provide a formal complexity analysis showing that M2S reduces training cost from $O (n^{2})$ to $O (n)$ for $n$ -turn conversations. Empirically, on our training dataset (779 samples, avg. 10.6 turns), M2S requires only 169K tokens compared to 15.7M tokens for the multi-turn baseline -- a 93 $\times$ reduction. We evaluate Defensive M2S across three guardrail model families (LlamaGuard, Nemotron, Qwen3Guard) and three compression templates (hyphenize, numberize, pythonize) on SafeDialBench, a comprehensive multi-turn…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Natural Language Processing Techniques