Training-Free Policy Violation Detection via Activation-Space Whitening in LLMs
Oren Rachmil, Avishag Shapira, Roy Betser, Itay Gershon, Omer Hofman, Asaf Shabtai, Yuval Elovici, Roman Vainshtein

TL;DR
This paper introduces a training-free method for detecting policy violations in large language models by transforming internal activations with whitening techniques, enabling efficient and accurate compliance assessment without additional training.
Contribution
The authors propose a novel activation-space whitening approach that detects policy violations without training, outperforming existing fine-tuning and LLM-as-a-judge methods in accuracy and efficiency.
Findings
Achieves 86.0% F1 score on policy violation detection benchmarks.
Outperforms fine-tuned baselines by up to 9.1 points.
Outperforms LLM-as-a-judge by 16 points with lower computational cost.
Abstract
As organizations increasingly deploy LLMs in sensitive domains such as legal, financial, and medical settings, ensuring alignment with internal organizational policies has become a priority. Existing content moderation frameworks remain largely confined to the safety domain and lack the robustness to capture nuanced organizational policies. LLM-as-a-judge and fine-tuning approaches, though flexible, introduce significant latency and training cost. To address these limitations, we frame policy violation detection as an out-of-distribution (OOD) problem in the model's activation space. We propose a training-free method that operates directly on the LLM internal representations, leveraging prior evidence that decision-relevant information is encoded within them. Inspired by whitening techniques, we apply a linear transformation to decorrelate and standardize the model's hidden activations,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Hate Speech and Cyberbullying Detection · Artificial Intelligence in Healthcare and Education
