HALO: Semantic-Aware Distributed LLM Inference in Lossy Edge Network
Peirong Zheng, Wenchao Xu, Haozhao Wang, Jinyu Chen, Xuemin Shen

TL;DR
HALO is a framework that enhances distributed large language model inference at the edge by using semantic-aware synchronization and load balancing, achieving significant speedups despite lossy network conditions.
Contribution
HALO introduces a semantic-aware predictor, parallel neuron loading, and load balancing to improve distributed LLM inference in unreliable edge networks, reducing synchronization delays.
Findings
3.41x end-to-end speedup on Raspberry Pi cluster
Maintains performance comparable to ideal conditions
Outperforms existing methods in lossy network scenarios
Abstract
The deployment of large language models' (LLMs) inference at the edge can facilitate prompt service responsiveness while protecting user privacy. However, it is critically challenged by the resource constraints of a single edge node. Distributed inference has emerged to aggregate and leverage computational resources across multiple devices. Yet, existing methods typically require strict synchronization, which is often infeasible due to the unreliable network conditions. In this paper, we propose HALO, a novel framework that can boost the distributed LLM inference in lossy edge network. The core idea is to enable a relaxed yet effective synchronization by strategically allocating less critical neuron groups to unstable devices, thus avoiding the excessive waiting time incurred by delayed packets. HALO introduces three key mechanisms: (1) a semantic-aware predictor to assess the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIoT and Edge/Fog Computing · Privacy-Preserving Technologies in Data · IoT Networks and Protocols
