Recursive Offloading for LLM Serving in Multi-tier Networks

Zhiyuan Wu; Sheng Sun; Yuwei Wang; Min Liu; Bo Gao; Jinda Lu; Zheming Yang; Tian Wen

arXiv:2505.16502·cs.DC·May 27, 2025

Recursive Offloading for LLM Serving in Multi-tier Networks

Zhiyuan Wu, Sheng Sun, Yuwei Wang, Min Liu, Bo Gao, Jinda Lu, Zheming Yang, Tian Wen

PDF

Open Access 1 Repo

TL;DR

RecServe is a recursive offloading framework that optimizes LLM inference in multi-tier networks by dynamically adjusting offloading decisions based on task complexity and confidence, significantly reducing communication overhead and improving service quality.

Contribution

The paper introduces RecServe, a novel recursive offloading framework with hierarchical confidence evaluation and dynamic thresholding for efficient LLM serving across device, edge, and cloud tiers.

Findings

01

RecServe reduces communication overhead by over 50% compared to centralized cloud serving.

02

RecServe outperforms existing methods like CasServe in service quality.

03

Experimental results on eight datasets validate RecServe's effectiveness.

Abstract

Heterogeneous device-edge-cloud computing infrastructures have become widely adopted in telecommunication operators and Wide Area Networks (WANs), offering multi-tier computational support for emerging intelligent services. With the rapid proliferation of Large Language Model (LLM) services, efficiently coordinating inference tasks and reducing communication overhead within these multi-tier network architectures becomes a critical deployment challenge. Existing LLM serving paradigms exhibit significant limitations: on-device deployment supports only lightweight LLMs due to hardware constraints, while cloud-centric deployment suffers from resource congestion and considerable prompt communication overhead caused by frequent service requests during peak periods. Although the model-cascading-based inference strategy adapts better to multi-tier networks, its reliance on fine-grained,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wuzhiyuan2000/recserve
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced MIMO Systems Optimization · IoT and Edge/Fog Computing · IoT Networks and Protocols