Detecting Sleeper Agents in Large Language Models via Semantic Drift Analysis
Shahin Zanbaghi, Ryan Rostampour, Farhan Abid, Salim Al Jarmakani

TL;DR
This paper introduces a real-time, practical detection system for sleeper agents in large language models, combining semantic drift analysis and canary baselines to identify backdoors with high accuracy and zero false positives.
Contribution
It presents the first practical, real-time detection method for LLM backdoors that does not require model modification or extensive retraining.
Findings
Achieves 92.5% accuracy and 100% precision in detection
Operates in under 1 second per query
Requires no changes to the original model
Abstract
Large Language Models (LLMs) can be backdoored to exhibit malicious behavior under specific deployment conditions while appearing safe during training a phenomenon known as "sleeper agents." Recent work by Hubinger et al. demonstrated that these backdoors persist through safety training, yet no practical detection methods exist. We present a novel dual-method detection system combining semantic drift analysis with canary baseline comparison to identify backdoored LLMs in real-time. Our approach uses Sentence-BERT embeddings to measure semantic deviation from safe baselines, complemented by injected canary questions that monitor response consistency. Evaluated on the official Cadenza-Labs dolphin-llama3-8B sleeper agent model, our system achieves 92.5% accuracy with 100% precision (zero false positives) and 85% recall. The combined detection method operates in real-time (<1s per query),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Spam and Phishing Detection · Advanced Malware Detection Techniques
