MPC-Minimized Secure LLM Inference

Deevashwer Rathee; Dacheng Li; Ion Stoica; Hao Zhang; Raluca Popa

arXiv:2408.03561·cs.CR·August 8, 2024

MPC-Minimized Secure LLM Inference

Deevashwer Rathee, Dacheng Li, Ion Stoica, Hao Zhang, Raluca Popa

PDF

Open Access 3 Reviews

TL;DR

Marill is a novel fine-tuning framework that reduces the computational and communication overhead of secure LLM inference using MPC, making privacy-preserving inference more practical without significant performance loss.

Contribution

It introduces architectural modifications during fine-tuning that minimize MPC operations, significantly improving efficiency of secure LLM inference.

Findings

01

3.6-11.3x faster runtime during secure inference

02

2.4-6.9x reduced communication overhead

03

Over 90% task performance preserved

Abstract

Many inference services based on large language models (LLMs) pose a privacy concern, either revealing user prompts to the service or the proprietary weights to the user. Secure inference offers a solution to this problem through secure multi-party computation (MPC), however, it is still impractical for modern LLM workload due to the large overhead imposed by MPC. To address this overhead, we propose Marill, a framework that adapts LLM fine-tuning to minimize MPC usage during secure inference. Marill introduces high-level architectural changes during fine-tuning that significantly reduce the number of expensive operations needed within MPC during inference, by removing some and relocating others outside MPC without compromising security. As a result, Marill-generated models are more efficient across all secure inference protocols and our approach complements MPC-friendly approximations…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 4

Strengths

The ideas considered are sound, and the authors provide an implementation for reproducibility (which, disclaimer, I didn't verify myself). The idea of head-merging is, to the best of my knowledge, novel. It is relevant as it seems to work well in terms of accuracy, but it boosts the efficiency of MPC noticeably. The experiments are comprehensive and thorough. They are well described and enable good reproducibility. Overall the paper is well written.

Weaknesses

I am not sure the problem is very well motivated. This is not about running inference obliviously, which is very well motivated. This is is what most (if not all) of the cited prior works consider: a model owner wants to keep their model private and a client wants to keep the query hidden, while still learning the inference result. This works starts from the premise that the model that the client may want to query is actually a fine-tuned version of an open-source model, the the data for fine-tu

Reviewer 02Rating 5Confidence 5

Strengths

The paper describes three techniques to minimize the use of MPC: Layer Freezing, Low-Rank Adaptation (LoRA), and Head Merging. These techniques reduce the cost of MPC in secure inference.

Weaknesses

1-The paper's innovation is limited. The proposed method does not include improvements to the MPC protocol itself but focuses on different partitioning and head merging methods in the fine-tuning network structure. Regarding the Layer Freezing method in the partitioning approach, is there any reference basis for selecting how many layers to freeze for different LLMs to achieve a balance between accuracy and timeliness? 2-Insufficient comparison of accuracy in experimental results. The paper doe

Reviewer 03Rating 8Confidence 4

Strengths

1. This work innovatively introduces layer freezing and LoRA techniques from plaintext inference to secure inference. Especially it reduces bottlenecked matmul dimensions in MPC-based inference. 2. This work well describes threat models in secure inference of LLMs and explains why each component can be optimized with considering potential attacks. Also, it explains why some parts cannot benefit from optimization by nature of MPC, such as mixture of weights.

Weaknesses

1. In Section 5.1, authors do not explain the criteria or threshold of the frozen fraction f. 2. Since only 2PC-Dealer supports GPU acceleration, the comparison to other MPC approaches with CPU only looks unfair due to the hardware difference. If we want to do apple-to-apple comparison, it is better to make 2PC-Dealer run with CPU only, too.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Interconnection Networks and Systems · Distributed and Parallel Computing Systems

Methodstravel james