ShadowLogic: Backdoors in Any Whitebox LLM
Kasimir Schulz, Amelia Kawasaki, Leo Ring

TL;DR
ShadowLogic reveals a vulnerability in large language models where backdoors can be covertly embedded into their computational graphs, allowing malicious content removal when triggered by specific phrases, with minimal model alterations.
Contribution
We introduce ShadowLogic, a novel method for embedding undetectable backdoors into white-box LLMs via computational graph manipulation, enabling controlled content uncensoring.
Findings
Achieved over 60% success rate in activating backdoors.
Successfully implemented ShadowLogic in Phi-3 and Llama 3.2 models.
Demonstrated minimal parameter changes can embed effective backdoors.
Abstract
Large language models (LLMs) are widely deployed across various applications, often with safeguards to prevent the generation of harmful or restricted content. However, these safeguards can be covertly bypassed through adversarial modifications to the computational graph of a model. This work highlights a critical security vulnerability in computational graph-based LLM formats, demonstrating that widely used deployment pipelines may be susceptible to obscured backdoors. We introduce ShadowLogic, a method for creating a backdoor in a white-box LLM by injecting an uncensoring vector into its computational graph representation. We set a trigger phrase that, when added to the beginning of a prompt into the LLM, applies the uncensoring vector and removes the content generation safeguards in the model. We embed trigger logic directly into the computational graph which detects the trigger…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Advanced Graph Neural Networks
