Argus: Token Aware Distributed LLM Inference Optimization
Panlong Wu, Yifei Zhong, Danyang Chen, Ting Wang, Fangxin Wang

TL;DR
Argus is a token-aware distributed inference framework for large language models that optimizes task offloading across edge-cloud systems by accurately predicting output token lengths and dynamically managing resources.
Contribution
This work introduces Argus, the first token-aware distributed LLM inference system with a novel length prediction module and a Lyapunov-guided optimization for efficient offloading in heterogeneous environments.
Findings
Argus significantly reduces inference latency in dynamic edge-cloud settings.
The LAS module accurately predicts output token lengths, improving offloading decisions.
Extensive evaluations show Argus outperforms existing methods in efficiency and robustness.
Abstract
Large Language Models (LLMs) are rapidly being integrated into real-world applications, yet their autoregressive architectures introduce significant inference time variability, especially when deployed across heterogeneous edge-cloud systems. Existing solutions largely neglect the dynamic, stochastic, and heterogeneous nature of such environments, often ignoring the impact of variable output token lengths and device diversity. In this work, we present Argus, the first token-aware distributed edge-cloud LLM inference framework that conducts efficient task offloading. Argus features a Length-Aware Semantics (LAS) module, which predicts output token lengths for incoming prompts using a fine-tuned language model with token-length-sensitive feature modulation, enabling precise estimation. Building on this, our Lyapunov-guided Offloading Optimization (LOO) module formulates long-term…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIoT and Edge/Fog Computing · Big Data and Digital Economy · Advanced Neural Network Applications
