Memory- and Latency-Constrained Inference of Large Language Models via Adaptive Split Computing
Mingyu Sung, Vikas Palakonda, Suhwan Im, Sunghwan Moon, Il-Min Kim, Sangseok Yun, Jae-Mo Kang

TL;DR
This paper presents an autoregressive-aware split computing framework for deploying large language models on resource-constrained devices, combining novel quantization and optimization techniques to reduce memory, latency, and communication costs.
Contribution
It introduces a new split computing framework with adaptive quantization and optimization strategies specifically designed for autoregressive LLM inference on edge devices.
Findings
Achieves 1.49x inference speedup over state-of-the-art methods.
Reduces communication overhead significantly while maintaining accuracy.
Demonstrates effectiveness across diverse LLMs and hardware platforms.
Abstract
Large language models (LLMs) have achieved near-human performance across diverse reasoning tasks, yet their deployment on resource-constrained Internet-of-Things (IoT) devices remains impractical due to massive parameter footprints and memory-intensive autoregressive decoding. While split computing offers a promising solution by partitioning model execution between edge devices and cloud servers, existing approaches fail to address the unique challenges of autoregressive inference, particularly the iterative token generation process and expanding key-value (KV) cache requirements. This work introduces the first autoregressive-aware split computing framework designed explicitly for LLM deployment on edge devices. Our approach makes three key contributions. First, we develop one-point split compression (OPSC), a mixed-precision quantization scheme that prevents out-of-memory failures by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · Multimodal Machine Learning Applications
