Detecting Sleeper Agents in Large Language Models via Semantic Drift Analysis

Shahin Zanbaghi; Ryan Rostampour; Farhan Abid; Salim Al Jarmakani

arXiv:2511.15992·cs.AI·November 21, 2025

Detecting Sleeper Agents in Large Language Models via Semantic Drift Analysis

Shahin Zanbaghi, Ryan Rostampour, Farhan Abid, Salim Al Jarmakani

PDF

Open Access

TL;DR

This paper introduces a real-time, practical detection system for sleeper agents in large language models, combining semantic drift analysis and canary baselines to identify backdoors with high accuracy and zero false positives.

Contribution

It presents the first practical, real-time detection method for LLM backdoors that does not require model modification or extensive retraining.

Findings

01

Achieves 92.5% accuracy and 100% precision in detection

02

Operates in under 1 second per query

03

Requires no changes to the original model

Abstract

Large Language Models (LLMs) can be backdoored to exhibit malicious behavior under specific deployment conditions while appearing safe during training a phenomenon known as "sleeper agents." Recent work by Hubinger et al. demonstrated that these backdoors persist through safety training, yet no practical detection methods exist. We present a novel dual-method detection system combining semantic drift analysis with canary baseline comparison to identify backdoored LLMs in real-time. Our approach uses Sentence-BERT embeddings to measure semantic deviation from safe baselines, complemented by injected canary questions that monitor response consistency. Evaluated on the official Cadenza-Labs dolphin-llama3-8B sleeper agent model, our system achieves 92.5% accuracy with 100% precision (zero false positives) and 85% recall. The combined detection method operates in real-time (<1s per query),…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Spam and Phishing Detection · Advanced Malware Detection Techniques