FastQuery: Communication-efficient Embedding Table Query for Private LLM   Inference

Chenqi Lin; Tianshi Xu; Zebin Yang; Runsheng Wang; Ru Huang; Meng Li

arXiv:2405.16241·cs.CR·May 28, 2024

FastQuery: Communication-efficient Embedding Table Query for Private LLM Inference

Chenqi Lin, Tianshi Xu, Zebin Yang, Runsheng Wang, Ru Huang, Meng Li

PDF

Open Access

TL;DR

FastQuery is a novel framework that significantly reduces computation and communication overhead in private LLM inference by optimizing embedding table queries through quantization and one-hot-aware packing, enabling more efficient privacy-preserving inference.

Contribution

FastQuery introduces a communication-aware quantization and one-hot-aware packing approach to optimize private embedding table queries, outperforming prior HE-based methods.

Findings

01

Achieves over 4.3x latency reduction compared to Cheetah.

02

Reduces communication by more than 75.7x on LLAMA-7B.

03

Demonstrates significant efficiency improvements on LLAMA-30B.

Abstract

With the fast evolution of large language models (LLMs), privacy concerns with user queries arise as they may contain sensitive information. Private inference based on homomorphic encryption (HE) has been proposed to protect user query privacy. However, a private embedding table query has to be formulated as a HE-based matrix-vector multiplication problem and suffers from enormous computation and communication overhead. We observe the overhead mainly comes from the neglect of 1) the one-hot nature of user queries and 2) the robustness of the embedding table to low bit-width quantization noise. Hence, in this paper, we propose a private embedding table query optimization framework, dubbed FastQuery. FastQuery features a communication-aware embedding table quantization algorithm and a one-hot-aware dense packing algorithm to simultaneously reduce both the computation and communication…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCryptography and Data Security · Privacy-Preserving Technologies in Data · Library Science and Information Systems