Activation Differences Reveal Backdoors: A Comparison of SAE Architectures
Sachin Kumar

TL;DR
This paper compares two sparse autoencoder architectures, Diff-SAE and Crosscoders, for detecting backdoors in language models, finding Diff-SAE significantly more effective in isolating backdoor features.
Contribution
The study demonstrates that difference-based autoencoders outperform traditional sparse autoencoders in backdoor detection, revealing that backdoors manifest as activation shifts rather than sparse features.
Findings
Diff-SAE achieves a BIS of 0.40 with perfect precision.
Crosscoders perform poorly with BIS below 0.02.
Backdoors manifest as activation shifts, not sparse features.
Abstract
Backdoor attacks on language models pose a significant threat to AI safety, where models behave normally on most inputs but exhibit harmful behavior when triggered by specific patterns. Detecting such backdoors through mechanistic interpretability remains an open challenge. We investigate two sparse autoencoder architectures -- Crosscoders and Differential SAEs (Diff-SAE) -- for isolating backdoor-related features in fine-tuned models. Using a controlled SQL injection backdoor triggered by year-based context ("2024" triggers vulnerable code, "2023" triggers safe code), we evaluate both approaches across LoRA and full-rank fine-tuning regimes on SmolLM2-360M. We find that Diff-SAE consistently and substantially outperforms Crosscoders for backdoor isolation. Diff-SAE achieves a Backdoor Isolation Score (BIS) of 0.40 with perfect precision (1.0) and zero false positive rate across most…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
