Distributed Generative Inference of LLM at Internet Scales with Multi-Dimensional Communication Optimization

Jiu Chen; Shuangyan Yang; Xu Xiong; Hexiao Duan; Xinran Zhang; Jie Ren; Dong Li

arXiv:2604.21072·cs.DC·May 6, 2026

Distributed Generative Inference of LLM at Internet Scales with Multi-Dimensional Communication Optimization

Jiu Chen, Shuangyan Yang, Xu Xiong, Hexiao Duan, Xinran Zhang, Jie Ren, Dong Li

PDF

1 Repo

TL;DR

BloomBee is a novel decentralized LLM inference framework that optimizes communication and computation across internet-scale networks, significantly improving throughput and latency.

Contribution

It introduces a multi-dimensional communication optimization approach for decentralized LLM inference, formulated as an optimization problem and solved with dynamic programming.

Findings

01

Improves service throughput by up to 1.76x.

02

Reduces average latency by up to 43.20%.

03

Effectively adapts to various low-bandwidth network environments.

Abstract

Decentralized LLM inference distributes computation among heterogeneous nodes across the internet, offering a performant and cost-efficient solution, alternative to traditional centralized inference. However, the low cross-node network bandwidth makes communication the primary bottleneck. In this paper, we introduce BloomBee, an internet-scale distributed LLM inference framework. BloomBee integrates LLM-layer assignment, micro-batching and tensor offloading to optimize communication from multiple dimensions. Additionally, BloomBee formulates the coordination of these techniques as an optimization problem and solves it using dynamic programming. BloomBee also customizes lossless compression and speculative decoding according to low-bandwidth network settings to reduce communication overhead. We evaluate BloomBee across a spectrum of network environments and show that it improves service…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.