Privacy-Aware Split Inference with Speculative Decoding for Large Language Models over Wide-Area Networks

Michael Cunningham

arXiv:2602.16760·cs.CR·February 20, 2026

Privacy-Aware Split Inference with Speculative Decoding for Large Language Models over Wide-Area Networks

Michael Cunningham

PDF

Open Access

TL;DR

This paper introduces a privacy-preserving split inference system for large language models over wide-area networks, utilizing asymmetric layer splits, lookahead decoding, and empirical privacy assessments to optimize performance and privacy.

Contribution

It presents the first application of lookahead decoding for split LLM inference over WANs, with empirical privacy evaluation and scalability to large models like NeMo 12B.

Findings

01

Achieves 8.7-9.3 tokens/sec on 7B models over WAN

02

Empirical inversion attack shows privacy tradeoff with split depth

03

System scales to 12B models with minimal local VRAM

Abstract

We present a practical system for privacy-aware large language model (LLM) inference that splits a transformer between a trusted local GPU and an untrusted cloud GPU, communicating only intermediate activations over the network. Our system addresses the unique challenges of autoregressive LLM decoding over high-latency wide-area networks (WANs), contributing: (1) an asymmetric layer split where embedding and unembedding layers remain local, ensuring raw tokens never leave the trusted device; (2) the first application of lookahead decoding to split inference over WANs, amortizing network round-trip latency across multiple tokens per iteration; (3) an empirical inversion attack evaluation showing that split depth provides a tunable privacy-performance tradeoff -- an attacker can recover ~59%% of tokens at a 2-layer split but only ~35%% at an 8-layer split, with minimal throughput impact;…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Privacy-Preserving Technologies in Data