An Efficient Private GPT Never Autoregressively Decodes
Zhengyi Li, Yue Guan, Kang Yang, Yu Feng, Ning Liu, Yu Yu, Jingwen Leng, Minyi Guo

TL;DR
This paper introduces a privacy-preserving GPT inference method that leverages public models for decoding and secure verification, significantly reducing latency while maintaining privacy and quality.
Contribution
It proposes a novel approach combining public decoding with secure verification, optimized private sampling, and model alignment to accelerate private GPT inference.
Findings
Achieves 2.1x to 6.0x speedup over standard secure decoding.
Maintains privacy and generation quality comparable to traditional methods.
Effective across multiple model pairs and network conditions.
Abstract
The wide deployment of the generative pre-trained transformer (GPT) has raised privacy concerns for both clients and servers. While cryptographic primitives can be employed for secure GPT inference to protect the privacy of both parties, they introduce considerable performance overhead.To accelerate secure inference, this study proposes a public decoding and secure verification approach that utilizes public GPT models, motivated by the observation that securely decoding one and multiple tokens takes a similar latency. The client uses the public model to generate a set of tokens, which are then securely verified by the private model for acceptance. The efficiency of our approach depends on the acceptance ratio of tokens proposed by the public model, which we improve from two aspects: (1) a private sampling protocol optimized for cryptographic primitives and (2) model alignment using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsCryptography and Data Security · Privacy-Preserving Technologies in Data · Computability, Logic, AI Algorithms
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Softmax · Cosine Annealing · Attention Dropout · Linear Layer · Residual Connection · Byte Pair Encoding · Weight Decay · Dropout
