Guarding the Meaning: Self-Supervised Training for Semantic Robustness in Guard Models
Cristina Pinneri, Christos Louizos

TL;DR
This paper introduces a self-supervised training framework that enhances the semantic robustness of guard models in large language models, reducing their sensitivity to superficial paraphrases and improving safety consistency.
Contribution
It proposes a skew-aware aggregation method for training guard models, significantly improving their semantic stability and calibration, and demonstrating its effectiveness on open-source models.
Findings
Reduces semantic variability across paraphrases by ~58%
Improves benchmark accuracy by ~2.5% on average
Enhances model calibration by up to 40%
Abstract
Guard models are a critical component of LLM safety, but their sensitivity to superficial linguistic variations remains a key vulnerability. We show that even meaning-preserving paraphrases can cause large fluctuations in safety scores, revealing a lack of semantic grounding. To address this, we introduce a practical, self-supervised framework for improving the semantic robustness of guard models. Our method leverages paraphrase sets to enforce prediction consistency using a novel, skew-aware aggregation strategy for robust target computation. Notably, we find that standard aggregation methods like mean and median can degrade safety, underscoring the need for skew-aware alternatives. We analyze six open-source guard models and show that our approach reduces semantic variability across paraphrases by ~58%, improves benchmark accuracy by ~2.5% on average, and generalizes to unseen…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper addresses a real and under-explored problem: safety classifiers should be invariant to meaning-preserving paraphrases. Demonstrating that even strong guard models (LLaMA Guard v3, Granite Guardian, ShieldGemma) fail on this dimension is valuable evidence for the community.
1. The paper assumes that paraphrasing captures all meaningful linguistic variation, but does not test robustness under more realistic noise sources such as incomplete sentences, mixed languages, or user-typed grammatical errors. As a result, it remains unclear whether the proposed method generalizes beyond paraphrase-style rewordings to genuine user diversity. 2. The approach depends on an LLM judge to filter paraphrases but does not analyze how this filtering bias affects training. If the judg
- The overall presentation of the paper is clear, well-structured, and easy to follow. - The authors address an important challenge i.e. improving the robustness of guard models against paraphrased text. - The proposed method is simple yet effective, and the detailed explanation of how to select an appropriate set-level target greatly aids in understanding the methodology. - The visualization of safety scores before and after applying the proposed method is clear, intuitive, and effectively i
- In the paper, the authors mention that the same LLM was used for both generating and filtering paraphrases. Why was the same model employed for both tasks? Wouldn’t the approach be more robust and effective if a different LLM were used for filtering? - It would have been better if some manual filtering had been performed to check the agreement between the LLM and human evaluations. Although the LLM judge is validated using the STS-B benchmark, it would be useful to know how well it performs f
Strengths: 1. The proposed setup improves upon existing open-weight safeguard models rather than proposing a new one. This makes it easy to integrate the finetuned model into the existing pipeline. 2. The authors examine multiple aggregation methods for bringing together paraphrased clusters and their ratings and observe that mean aggregation can sometimes overfit to the safe mode, whereas skewness-aware aggregation provides better balance.
Weaknesses: 1. It is not clear from the current experimental section how many paraphrases per sentence are generated, and how many of them are retained/rejected in the constrained set. If the LLm-as-judge checking is not performed, then what is the quality of the samples generated? This is important to discuss as the aggregation method can be randomly effective or ineffective over a small sample size. 2. More explanation and grounding are needed on how the percentile values are set for thresho
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Natural Language Processing Techniques
