Privacy-Aware Split Inference with Speculative Decoding for Large Language Models over Wide-Area Networks
Michael Cunningham

TL;DR
This paper introduces a privacy-preserving split inference system for large language models over wide-area networks, utilizing asymmetric layer splits, lookahead decoding, and empirical privacy assessments to optimize performance and privacy.
Contribution
It presents the first application of lookahead decoding for split LLM inference over WANs, with empirical privacy evaluation and scalability to large models like NeMo 12B.
Findings
Achieves 8.7-9.3 tokens/sec on 7B models over WAN
Empirical inversion attack shows privacy tradeoff with split depth
System scales to 12B models with minimal local VRAM
Abstract
We present a practical system for privacy-aware large language model (LLM) inference that splits a transformer between a trusted local GPU and an untrusted cloud GPU, communicating only intermediate activations over the network. Our system addresses the unique challenges of autoregressive LLM decoding over high-latency wide-area networks (WANs), contributing: (1) an asymmetric layer split where embedding and unembedding layers remain local, ensuring raw tokens never leave the trusted device; (2) the first application of lookahead decoding to split inference over WANs, amortizing network round-trip latency across multiple tokens per iteration; (3) an empirical inversion attack evaluation showing that split depth provides a tunable privacy-performance tradeoff -- an attacker can recover ~59%% of tokens at a 2-layer split but only ~35%% at an 8-layer split, with minimal throughput impact;…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Privacy-Preserving Technologies in Data
