FlexGuard: Continuous Risk Scoring for Strictness-Adaptive LLM Content Moderation
Zhihao Ding, Jinming Li, Ze Lu, and Jieming Shi

TL;DR
FlexGuard introduces a continuous risk scoring system for LLM content moderation, enabling adaptable strictness levels and improving robustness across different enforcement regimes.
Contribution
The paper presents FlexBench, a new benchmark for strictness-adaptive moderation, and FlexGuard, a model that outputs calibrated risk scores for flexible, robust content moderation.
Findings
FlexGuard outperforms existing models under varying strictness regimes.
Models trained with risk-alignment show improved score-severity consistency.
FlexGuard achieves higher accuracy and robustness on FlexBench and public benchmarks.
Abstract
Ensuring the safety of LLM-generated content is essential for real-world deployment. Most existing guardrail models formulate moderation as a fixed binary classification task, implicitly assuming a fixed definition of harmfulness. In practice, enforcement strictness - how conservatively harmfulness is defined and enforced - varies across platforms and evolves over time, making binary moderators brittle under shifting requirements. We first introduce FlexBench, a strictness-adaptive LLM moderation benchmark that enables controlled evaluation under multiple strictness regimes. Experiments on FlexBench reveal substantial cross-strictness inconsistency in existing moderators: models that perform well under one regime can degrade substantially under others, limiting their practical usability. To address this, we propose FlexGuard, an LLM-based moderator that outputs a calibrated continuous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
