Training-Free Policy Violation Detection via Activation-Space Whitening in LLMs

Oren Rachmil; Avishag Shapira; Roy Betser; Itay Gershon; Omer Hofman; Asaf Shabtai; Yuval Elovici; Roman Vainshtein

arXiv:2512.03994·cs.LG·January 21, 2026

Training-Free Policy Violation Detection via Activation-Space Whitening in LLMs

Oren Rachmil, Avishag Shapira, Roy Betser, Itay Gershon, Omer Hofman, Asaf Shabtai, Yuval Elovici, Roman Vainshtein

PDF

Open Access 1 Video

TL;DR

This paper introduces a training-free method for detecting policy violations in large language models by transforming internal activations with whitening techniques, enabling efficient and accurate compliance assessment without additional training.

Contribution

The authors propose a novel activation-space whitening approach that detects policy violations without training, outperforming existing fine-tuning and LLM-as-a-judge methods in accuracy and efficiency.

Findings

01

Achieves 86.0% F1 score on policy violation detection benchmarks.

02

Outperforms fine-tuned baselines by up to 9.1 points.

03

Outperforms LLM-as-a-judge by 16 points with lower computational cost.

Abstract

As organizations increasingly deploy LLMs in sensitive domains such as legal, financial, and medical settings, ensuring alignment with internal organizational policies has become a priority. Existing content moderation frameworks remain largely confined to the safety domain and lack the robustness to capture nuanced organizational policies. LLM-as-a-judge and fine-tuning approaches, though flexible, introduce significant latency and training cost. To address these limitations, we frame policy violation detection as an out-of-distribution (OOD) problem in the model's activation space. We propose a training-free method that operates directly on the LLM internal representations, leveraging prior evidence that decision-relevant information is encoded within them. Inspired by whitening techniques, we apply a linear transformation to decorrelate and standardize the model's hidden activations,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Training-Free Policy Violation Detection via Activation-Space Whitening in LLMs· underline

Taxonomy

TopicsTopic Modeling · Hate Speech and Cyberbullying Detection · Artificial Intelligence in Healthcare and Education