Trusted Weights, Treacherous Optimizations? Optimization-Triggered Backdoor Attacks on LLMs
Yifei Wang, Tianlin Li, Xiaohan Zhang, Yida Yang, Xiaoyu Zhang, Li Pan

TL;DR
This paper uncovers how compilation optimization of large language models can be maliciously exploited to implant stealthy backdoors that bypass standard safety checks, posing new security risks.
Contribution
It introduces a unified attack framework exploiting numerical side effects of compilation to trigger backdoors in LLMs without modifying hardware or compilers.
Findings
Backdoors achieve 90% success rate across multiple LLMs and tasks.
Clean accuracy remains nearly 100% despite backdoors.
The attack bypasses standard safety evaluations.
Abstract
Inference optimization is a vital technique for deploying LLMs at scale. Compilation is the most widely adopted optimization technique for LLMs. While it assumes semantic equivalence between the original and compiled graphs, we first uncover its numerical side effects can be maliciously exploited to implant stealthy backdoors in LLMs. We propose a unified optimization-triggered attack framework comprising two complementary strategies. Without any modification to the compiler or hardware, one strategy flips predictions for specific inputs only when the model is compiled, while the other uses a universal trigger that remains dormant under uncompiled execution but hijacks arbitrary inputs once compilation optimization is applied. Both attacks bypass standard safety evaluations run without compilation. We empirically demonstrate that these optimization-triggered backdoors achieve attack…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
