SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive Behaviors

Maheep Chaudhary; Fazl Barez

arXiv:2505.14300·cs.AI·May 21, 2025

SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive Behaviors

Maheep Chaudhary, Fazl Barez

PDF

Open Access

TL;DR

SafetyNet is a real-time, unsupervised monitoring framework designed to detect harmful outputs in LLMs by identifying causal behavioral signatures, effectively countering deception and evasion tactics.

Contribution

The paper introduces SafetyNet, a multi-detector ensemble system that detects harmful behaviors in LLMs by monitoring diverse internal representations, even when models attempt deception.

Findings

01

96% accuracy in detecting harmful outputs

02

Models can exhibit causal signatures of harmful behavior

03

SafetyNet effectively counters evasion tactics

Abstract

High-risk industries like nuclear and aviation use real-time monitoring to detect dangerous system conditions. Similarly, Large Language Models (LLMs) need monitoring safeguards. We propose a real-time framework to predict harmful AI outputs before they occur by using an unsupervised approach that treats normal behavior as the baseline and harmful outputs as outliers. Our study focuses specifically on backdoor-triggered responses -- where specific input phrases activate hidden vulnerabilities causing the model to generate unsafe content like violence, pornography, or hate speech. We address two key challenges: (1) identifying true causal indicators rather than surface correlations, and (2) preventing advanced models from deception -- deliberately evading monitoring systems. Hence, we approach this problem from an unsupervised lens by drawing parallels to human deception: just as humans…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning