PAM: Training Policy-Aligned Moderation Filters at Scale

Masoomali Fatehkia; Enes Altinisik; Mohamed Osman; Husrev Taha Sencar

arXiv:2505.19766·cs.CL·January 8, 2026

PAM: Training Policy-Aligned Moderation Filters at Scale

Masoomali Fatehkia, Enes Altinisik, Mohamed Osman, Husrev Taha Sencar

PDF

Open Access

TL;DR

PAM introduces a scalable, flexible framework for training custom moderation filters aligned with user policies, outperforming existing safety filters and enabling broader alignment in large language models.

Contribution

The paper presents PAM, a novel method for automating the training of policy-aligned moderation filters without human-labeled data, supporting diverse real-world deployment needs.

Findings

01

PAM-trained filters match state-of-the-art safety filters.

02

PAM outperforms existing models on new policy enforcement benchmarks.

03

PAM filters run 5-100x faster at inference.

Abstract

Large language models (LLMs) remain vulnerable to misalignment and jailbreaks, making external safeguards like moderation filters essential, yet existing filters often focus narrowly on safety, falling short of the broader alignment needs seen in real-world deployments. We introduce Policy Aligned Moderation (PAM), a flexible framework for training custom moderation filters grounded in user-defined policies that extend beyond conventional safety objectives. PAM automates training data generation without relying on human-written examples, enabling scalable support for diverse, application-specific alignment goals and generation policies. PAM-trained filters match the performance of state-of-the-art safety moderation filters and policy reasoning models, and outperform them on PAMbench, four newly introduced user-annotated policy enforcement benchmarks that target age restrictions, dietary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Network Security and Intrusion Detection

MethodsFocus