Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs
Dylan Feng, Pragya Srivastava, Cassidy Laidlaw

TL;DR
This paper introduces MOOD, a benchmark for evaluating LLM monitors on out-of-distribution failures, and demonstrates that combining guard models with OOD detectors improves detection recall.
Contribution
It presents MOOD, a new benchmark for OOD failure detection in LLMs, and shows that combining guard models with OOD detectors enhances monitoring effectiveness.
Findings
Guard models often fail to generalize OOD.
Combining guard models with Mahalanobis and perplexity-based OOD detectors improves recall from 39% to 45%.
Scaling up models enhances monitor performance, with OOD detection surpassing larger guard models alone.
Abstract
Many safety and alignment failures of large language models (LLMs) occur due to out-of-distribution (OOD) situations: unusual prompt or response patterns that are unforeseen by model developers. We systematically study whether LLM monitoring pipelines can detect these OOD alignment failures by introducing a benchmark called Misalignment Out Of Distribution (MOOD). It is difficult to find failures that are truly OOD for off-the-shelf models trained on vast safety datasets. We sidestep this by including a restricted training set in MOOD that we use to train our own monitors, as well as seven test sets with diverse alignment failures that are outside the training distribution. Using MOOD, we find that guard models (safety classifiers) often fail to generalize OOD. To fix this, we propose combining guard models with OOD detectors. We test four types of OOD detectors and find that a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
