Efficient Edge LLMs Deployment via HessianAware Quantization and CPU GPU Collaborative

Tuo Zhang; Ning Li; Xin Yuan; Wenchao Xu; Quan Chen; Song Guo; Haijun Zhang

arXiv:2508.07329·cs.LG·August 12, 2025

Efficient Edge LLMs Deployment via HessianAware Quantization and CPU GPU Collaborative

Tuo Zhang, Ning Li, Xin Yuan, Wenchao Xu, Quan Chen, Song Guo, Haijun Zhang

PDF

Open Access

TL;DR

This paper introduces a Hessian-Aware Quantization method and CPU-GPU collaborative inference strategy to efficiently deploy large language models on edge devices, reducing memory and latency while maintaining accuracy.

Contribution

It proposes a novel Hessian-Aware Quantization technique and expert-level collaborative inference mechanism for edge deployment of MoE models, addressing outlier-induced accuracy loss and resource constraints.

Findings

01

Achieves near full-precision accuracy with 8-bit quantization on large models.

02

Reduces GPU memory usage by approximately 60%.

03

Significantly improves inference latency on edge hardware.

Abstract

With the breakthrough progress of large language models (LLMs) in natural language processing and multimodal tasks, efficiently deploying them on resource-constrained edge devices has become a critical challenge. The Mixture of Experts (MoE) architecture enhances model capacity through sparse activation, but faces two major difficulties in practical deployment: (1) The presence of numerous outliers in activation distributions leads to severe degradation in quantization accuracy for both activations and weights, significantly impairing inference performance; (2) Under limited memory, efficient offloading and collaborative inference of expert modules struggle to balance latency and throughput. To address these issues, this paper proposes an efficient MoE edge deployment scheme based on Hessian-Aware Quantization (HAQ) and CPU-GPU collaborative inference. First, by introducing smoothed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · Multimodal Machine Learning Applications