LLMEasyQuant: Scalable Quantization for Parallel and Distributed LLM Inference
Dong Liu, Yanxuan Yu

TL;DR
LLMEasyQuant is a flexible, system-aware quantization framework that enables efficient, scalable low-bit inference of large language models across various hardware setups, improving speed and resource utilization.
Contribution
It introduces a modular, system-aware quantization toolkit supporting multiple methods with unified interfaces, optimized for multi-GPU and distributed environments.
Findings
Achieves substantial speedup in GEMM execution and HBM load time.
Supports near-linear multi-GPU scaling.
Balances latency, memory, and accuracy effectively.
Abstract
As large language models (LLMs) grow in size and deployment scale, quantization has become an essential technique for reducing memory footprint and improving inference efficiency. However, existing quantization toolkits often lack transparency, flexibility, and system-level scalability across GPUs and distributed environments. We present \textbf{LLMEasyQuant}, a modular, system-aware quantization framework designed for efficient, low-bit inference of LLMs on single-node multi-GPU, multi-node, and edge hardware. LLMEasyQuant supports a wide range of quantization methods -- including Symmetric Quantization, ZeroQuant, SmoothQuant, and SimQuant -- with unified interfaces for per-layer calibration, bitwidth assignment, and runtime adaptation. It integrates fused CUDA kernels with NCCL-based distributed synchronization and supports both static and online quantization. Empirical results show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing
