PrivateLoRA For Efficient Privacy Preserving LLM
Yiming Wang, Yu Lin, Xiaodong Zeng, Guannan Zhang

TL;DR
PrivateLoRA introduces a privacy-preserving, communication-efficient framework for distributed LLM inference that maintains data locality and achieves high throughput on edge devices, enabling democratized access to advanced AI models.
Contribution
It proposes PrivateLoRA, a novel low-rank residual activation technique that significantly reduces communication overhead in distributed LLMs, ensuring privacy and efficiency.
Findings
Over 95% reduction in communication overhead.
Achieves 300% throughput of device-only solutions for 7B models.
Provides comparable tuning performance to LoRA for personalization.
Abstract
End users face a choice between privacy and efficiency in current Large Language Model (LLM) service paradigms. In cloud-based paradigms, users are forced to compromise data locality for generation quality and processing speed. Conversely, edge device paradigms maintain data locality but fail to deliver satisfactory performance. In this work, we propose a novel LLM service paradigm that distributes privacy-sensitive computation on edge devices and shared computation in the cloud. Only activations are transmitted between the central cloud and edge devices to ensure data locality. Our core innovation, PrivateLoRA, addresses the challenging communication overhead by exploiting the low rank of residual activations, achieving over 95% communication reduction. Consequently, PrivateLoRA effectively maintains data locality and is extremely resource efficient. Under standard 5G networks,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Topic Modeling
Methodstravel james
