LLMEasyQuant: Scalable Quantization for Parallel and Distributed LLM Inference

Dong Liu; Yanxuan Yu

arXiv:2406.19657·cs.LG·December 1, 2025·2 cites

LLMEasyQuant: Scalable Quantization for Parallel and Distributed LLM Inference

Dong Liu, Yanxuan Yu

PDF

Open Access 1 Repo

TL;DR

LLMEasyQuant is a flexible, system-aware quantization framework that enables efficient, scalable low-bit inference of large language models across various hardware setups, improving speed and resource utilization.

Contribution

It introduces a modular, system-aware quantization toolkit supporting multiple methods with unified interfaces, optimized for multi-GPU and distributed environments.

Findings

01

Achieves substantial speedup in GEMM execution and HBM load time.

02

Supports near-linear multi-GPU scaling.

03

Balances latency, memory, and accuracy effectively.

Abstract

As large language models (LLMs) grow in size and deployment scale, quantization has become an essential technique for reducing memory footprint and improving inference efficiency. However, existing quantization toolkits often lack transparency, flexibility, and system-level scalability across GPUs and distributed environments. We present \textbf{LLMEasyQuant}, a modular, system-aware quantization framework designed for efficient, low-bit inference of LLMs on single-node multi-GPU, multi-node, and edge hardware. LLMEasyQuant supports a wide range of quantization methods -- including Symmetric Quantization, ZeroQuant, SmoothQuant, and SimQuant -- with unified interfaces for per-layer calibration, bitwidth assignment, and runtime adaptation. It integrates fused CUDA kernels with NCCL-based distributed synchronization and supports both static and online quantization. Empirical results show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

NoakLiu/LLMEasyQuant
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing