Cascade: Token-Sharded Private LLM Inference
Rahul Thomas, Louai Zahran, Erica Choi, Akilesh Potti, Micah Goldblum, Arka Pal

TL;DR
Cascade introduces a scalable, token-sharded inference protocol for large language models that balances privacy with performance, enabling practical secure third-party LLM deployment.
Contribution
It proposes a novel token-sharded inference scheme that improves scalability and speed over existing cryptographic privacy methods for large language models.
Findings
Cascade is resistant to advanced privacy attacks.
It significantly outperforms existing secure inference schemes in speed.
The method enables practical privacy-preserving inference for large LLMs.
Abstract
As LLMs continue to increase in parameter size, the computational resources required to run them are available to fewer parties. Therefore, third-party inference services -- where LLMs are hosted by third parties with significant computational resources -- are becoming increasingly popular. However, third party inference raises critical concerns about user data privacy. To mitigate these risks, privacy researchers have developed provably secure schemes for third-party inference, such as Secure Multi-Party Computation (SMPC). However, SMPC protocols have significant computational and communication overhead, and do not scale to large models. In this work, we propose a new multi-party inference protocol, Cascade, that avoids these punitive costs by leveraging sharding in the sequence dimension to maintain privacy, trading off cryptographic privacy guarantees for increased performance and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
