Verifying LLM Inference to Detect Model Weight Exfiltration
Roy Rinberg, Adam Karvonen, Alexander Hoover, Daniel Reuter, Keri Warr

TL;DR
This paper presents a verification framework to detect and mitigate model weight exfiltration via steganography during large language model inference, significantly reducing information leakage with minimal performance impact.
Contribution
It formalizes model exfiltration as a security game, proposes a provably effective verification scheme, and introduces practical estimators for non-determinism in LLM inference.
Findings
Detector reduces exfiltratable info to <0.5% on 30B models
False-positive rate is below 0.01%
Achieves over 200x slowdown for adversaries
Abstract
As large AI models become increasingly valuable assets, the risk of model weight exfiltration from inference servers grows accordingly. An attacker controlling an inference server may exfiltrate model weights by hiding them within ordinary model responses, a strategy known as steganography. This work investigates how to verify LLM model inference to defend against such attacks and, more broadly, to detect anomalous or buggy behavior during inference. We formalize model weight exfiltration as a security game, propose a verification framework that can provably mitigate steganographic exfiltration, and specify the trust assumptions associated with our scheme. To enable verification, we characterize valid sources of non-determinism in large language model inference and introduce two practical estimators for them. We evaluate our detection framework on several open-weight models ranging from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Privacy-Preserving Technologies in Data · Explainable Artificial Intelligence (XAI)
