PRISM: Distributed Inference for Foundation Models at Edge

Muhammad Azlan Qazi; Alexandros Iosifidis; Qi Zhang

arXiv:2507.12145·cs.LG·July 17, 2025

PRISM: Distributed Inference for Foundation Models at Edge

Muhammad Azlan Qazi, Alexandros Iosifidis, Qi Zhang

PDF

Open Access

TL;DR

PRISM introduces a communication-efficient, compute-aware distributed inference strategy for foundation models on edge devices, significantly reducing data transfer and computation with minimal accuracy loss.

Contribution

It proposes novel approximation and restructuring techniques for Transformer inference, enabling scalable deployment of foundation models at the edge.

Findings

01

Up to 99.2% reduction in communication overhead for BERT

02

51.24% reduction in per-device computation for BERT

03

Minor accuracy degradation across evaluated models

Abstract

Foundation models (FMs) have achieved remarkable success across a wide range of applications, from image classification to natural langurage processing, but pose significant challenges for deployment at edge. This has sparked growing interest in developing practical and efficient strategies for bringing foundation models to edge environments. In this work, we propose PRISM, a communication-efficient and compute-aware strategy for distributed Transformer inference on edge devices. Our method leverages a Segment Means representation to approximate intermediate output features, drastically reducing inter-device communication. Additionally, we restructure the self-attention mechanism to eliminate redundant computations caused by per-device Key/Value calculation in position-wise partitioning and design a partition-aware causal masking scheme tailored for autoregressive models. We evaluate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Distributed and Parallel Computing Systems · Geological Modeling and Analysis