HCInfer: An Efficient Inference System via Error Compensation for Resource-Constrained Devices
Shen Xu, Xiangwen Zhuge, Zhe Xu, Yingkun Hu, Zheng Yang, Yunhao Liu

TL;DR
HCInfer is a novel inference system that offloads error compensation to CPU, enabling efficient deployment of large language models on resource-limited devices with improved accuracy and speed.
Contribution
It introduces a heterogeneous inference framework with asynchronous compensation and dynamic rank allocation to enhance accuracy and efficiency on constrained hardware.
Findings
Achieves up to 5.2% accuracy improvement on downstream tasks.
Provides up to 10.4x speedup over full-precision models.
Effectively offloads residual compensation to CPU for resource efficiency.
Abstract
LLMs often struggle with memory-constrained deployment on consumer-grade hardware due to their massive parameter sizes. While existing solutions such as model compression and offloading improve deployment feasibility, they often suffer from substantial accuracy degradation or severe throughput bottlenecks. Recent error compensation methods recover accuracy through auxiliary LoRA-style branches, and we observe that these branches are inherently amenable to offloading: they require substantial parameter storage but access only a small subset of compensation parameters during each inference step. Motivated by this opportunity, we propose HCInfer, a heterogeneous inference system that offloads residual compensation to the CPU while executing the compressed backbone on the GPU, and further introduces an asynchronous compensation pipeline and sensitivity-aware dynamic rank allocation to hide…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
