SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive Behaviors
Maheep Chaudhary, Fazl Barez

TL;DR
SafetyNet is a real-time, unsupervised monitoring framework designed to detect harmful outputs in LLMs by identifying causal behavioral signatures, effectively countering deception and evasion tactics.
Contribution
The paper introduces SafetyNet, a multi-detector ensemble system that detects harmful behaviors in LLMs by monitoring diverse internal representations, even when models attempt deception.
Findings
96% accuracy in detecting harmful outputs
Models can exhibit causal signatures of harmful behavior
SafetyNet effectively counters evasion tactics
Abstract
High-risk industries like nuclear and aviation use real-time monitoring to detect dangerous system conditions. Similarly, Large Language Models (LLMs) need monitoring safeguards. We propose a real-time framework to predict harmful AI outputs before they occur by using an unsupervised approach that treats normal behavior as the baseline and harmful outputs as outliers. Our study focuses specifically on backdoor-triggered responses -- where specific input phrases activate hidden vulnerabilities causing the model to generate unsafe content like violence, pornography, or hate speech. We address two key challenges: (1) identifying true causal indicators rather than surface correlations, and (2) preventing advanced models from deception -- deliberately evading monitoring systems. Hence, we approach this problem from an unsupervised lens by drawing parallels to human deception: just as humans…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
