PAM: Training Policy-Aligned Moderation Filters at Scale
Masoomali Fatehkia, Enes Altinisik, Mohamed Osman, Husrev Taha Sencar

TL;DR
PAM introduces a scalable, flexible framework for training custom moderation filters aligned with user policies, outperforming existing safety filters and enabling broader alignment in large language models.
Contribution
The paper presents PAM, a novel method for automating the training of policy-aligned moderation filters without human-labeled data, supporting diverse real-world deployment needs.
Findings
PAM-trained filters match state-of-the-art safety filters.
PAM outperforms existing models on new policy enforcement benchmarks.
PAM filters run 5-100x faster at inference.
Abstract
Large language models (LLMs) remain vulnerable to misalignment and jailbreaks, making external safeguards like moderation filters essential, yet existing filters often focus narrowly on safety, falling short of the broader alignment needs seen in real-world deployments. We introduce Policy Aligned Moderation (PAM), a flexible framework for training custom moderation filters grounded in user-defined policies that extend beyond conventional safety objectives. PAM automates training data generation without relying on human-written examples, enabling scalable support for diverse, application-specific alignment goals and generation policies. PAM-trained filters match the performance of state-of-the-art safety moderation filters and policy reasoning models, and outperform them on PAMbench, four newly introduced user-annotated policy enforcement benchmarks that target age restrictions, dietary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Network Security and Intrusion Detection
MethodsFocus
