PRISM: Distributed Inference for Foundation Models at Edge
Muhammad Azlan Qazi, Alexandros Iosifidis, Qi Zhang

TL;DR
PRISM introduces a communication-efficient, compute-aware distributed inference strategy for foundation models on edge devices, significantly reducing data transfer and computation with minimal accuracy loss.
Contribution
It proposes novel approximation and restructuring techniques for Transformer inference, enabling scalable deployment of foundation models at the edge.
Findings
Up to 99.2% reduction in communication overhead for BERT
51.24% reduction in per-device computation for BERT
Minor accuracy degradation across evaluated models
Abstract
Foundation models (FMs) have achieved remarkable success across a wide range of applications, from image classification to natural langurage processing, but pose significant challenges for deployment at edge. This has sparked growing interest in developing practical and efficient strategies for bringing foundation models to edge environments. In this work, we propose PRISM, a communication-efficient and compute-aware strategy for distributed Transformer inference on edge devices. Our method leverages a Segment Means representation to approximate intermediate output features, drastically reducing inter-device communication. Additionally, we restructure the self-attention mechanism to eliminate redundant computations caused by per-device Key/Value calculation in position-wise partitioning and design a partition-aware causal masking scheme tailored for autoregressive models. We evaluate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Distributed and Parallel Computing Systems · Geological Modeling and Analysis
