Distributed On-Device LLM Inference With Over-the-Air Computation
Kai Zhang, Hengtao He, Shenghui Song, Jun Zhang, and Khaled B. Letaief

TL;DR
This paper introduces a distributed on-device LLM inference framework that uses tensor parallelism and over-the-air computation to reduce latency and communication overhead on edge devices.
Contribution
It proposes a novel over-the-air computation method combined with joint model and transceiver optimization for efficient distributed LLM inference.
Findings
Significantly reduces inference latency.
Improves inference accuracy.
Enables practical deployment of LLMs on resource-constrained devices.
Abstract
Large language models (LLMs) have achieved remarkable success across various artificial intelligence tasks. However, their enormous sizes and computational demands pose significant challenges for the deployment on edge devices. To address this issue, we present a distributed on-device LLM inference framework based on tensor parallelism, which partitions neural network tensors (e.g., weight matrices) of LLMs among multiple edge devices for collaborative inference. Nevertheless, tensor parallelism involves frequent all-reduce operations to aggregate intermediate layer outputs across participating devices during inference, resulting in substantial communication overhead. To mitigate this bottleneck, we propose an over-the-air computation method that leverages the analog superposition property of wireless multiple-access channels to facilitate fast all-reduce operations. To minimize the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReal-time simulation and control systems · Medical Imaging Techniques and Applications · Image and Signal Denoising Methods
