Backdoor Channels Hidden in Latent Space: Cryptographic Undetectability in Modern Neural Networks
Marte Eggen, Eirik Reiestad, Kristian Gj{\o}steen, Inga Str\"umke

TL;DR
This paper demonstrates that modern neural networks can be backdoored in a cryptographically undetectable way by exploiting latent space directions, challenging assumptions about model security.
Contribution
It introduces a novel attack method that aligns with cryptographic undetectability, applicable to state-of-the-art architectures like ResNet and Vision Transformers.
Findings
Backdoors can be embedded as latent directions in learned representations.
The attack maintains high success rates with minimal impact on model accuracy.
Post-training defenses are ineffective against this latent space backdoor approach.
Abstract
Recent cryptographic results establish that neural networks can be backdoored such that no efficient algorithm can distinguish them from a clean model. These guarantees, however, have been confined to stylised architectures of limited practical relevance, leaving open whether comparable undetectability extends to modern, end-to-end trained networks. We construct such an attack mechanism for state-of-the-art architectures, closely aligned to the cryptographic notion of undetectability, by identifying backdoor channels as learned latent directions, and show that the question of undetectability reduces to a hypothesis test between two unknown distributions over model parameters, which we conjecture to be intractable in practice. The consequence of this reframing is significant: if exploitable channels within a network's latent space are statistically indistinguishable from naturally…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
