HCInfer: An Efficient Inference System via Error Compensation for Resource-Constrained Devices

Shen Xu; Xiangwen Zhuge; Zhe Xu; Yingkun Hu; Zheng Yang; Yunhao Liu

arXiv:2605.05819·cs.LG·May 8, 2026

HCInfer: An Efficient Inference System via Error Compensation for Resource-Constrained Devices

Shen Xu, Xiangwen Zhuge, Zhe Xu, Yingkun Hu, Zheng Yang, Yunhao Liu

PDF

TL;DR

HCInfer is a novel inference system that offloads error compensation to CPU, enabling efficient deployment of large language models on resource-limited devices with improved accuracy and speed.

Contribution

It introduces a heterogeneous inference framework with asynchronous compensation and dynamic rank allocation to enhance accuracy and efficiency on constrained hardware.

Findings

01

Achieves up to 5.2% accuracy improvement on downstream tasks.

02

Provides up to 10.4x speedup over full-precision models.

03

Effectively offloads residual compensation to CPU for resource efficiency.

Abstract

LLMs often struggle with memory-constrained deployment on consumer-grade hardware due to their massive parameter sizes. While existing solutions such as model compression and offloading improve deployment feasibility, they often suffer from substantial accuracy degradation or severe throughput bottlenecks. Recent error compensation methods recover accuracy through auxiliary LoRA-style branches, and we observe that these branches are inherently amenable to offloading: they require substantial parameter storage but access only a small subset of compensation parameters during each inference step. Motivated by this opportunity, we propose HCInfer, a heterogeneous inference system that offloads residual compensation to the CPU while executing the compressed backbone on the GPU, and further introduces an asynchronous compensation pipeline and sensitivity-aware dynamic rank allocation to hide…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.