CacheTrap: Unveiling a Stealthier Gray-Box Trojan against LLMs
Mohaiminul Al Nahian (1), Abeer Matar A. Almalky (1), Gamana Aragonda (2), Ranyang Zhou (2), Sabbir Ahmed (1), Dmitry Ponomarev (1), Li Yang (3), Shaahin Angizi (2), Adnan Siraj Rakin (1) ((1) SUNY Binghamton, (2) New Jersey Institute of Technology, (3) UNC Charlotte)

TL;DR
CacheTrap is a novel gray-box attack on LLMs that manipulates the KV cache with a single-bit flip to trigger targeted behaviors without altering model weights or inputs.
Contribution
It introduces the first gray-box Trojan attack on LLMs' KV cache, using an efficient search to find vulnerable cache positions with minimal impact.
Findings
Achieves 100% attack success rate with the trigger.
Preserves benign accuracy when not triggered.
Effective across five open-source LLMs.
Abstract
The rapid advancement of large language models (LLMs) has sparked growing interest in understanding their security vulnerabilities, particularly Trojan attacks that enable stealthy manipulation of model behavior. Traditional Trojan methods typically alter inputs and/or model weights, relying on white-box assumptions that require access to data or model internal parameters. In this work, we present CacheTrap, the first gray-box Trojan attack targeting the Key-Value (KV) cache of LLMs. This method induces a single-bit flip in the KV cache, serving as a transient trigger. When activated, this trigger causes the model to exhibit targeted actions without changing inputs or model weights. CacheTrap introduces an efficient search algorithm to locate vulnerable positions in the KV cache, independent of model weights or datasets. Extensive experiments on five open-source LLMs show a remarkable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
