MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models
Elias Frantar, Roberto L. Castro, Jiale Chen, Torsten Hoefler, Dan, Alistarh

TL;DR
MARLIN introduces a novel mixed-precision kernel design that enables efficient batched inference on large language models, achieving significant speedups through advanced techniques like asynchronous memory access and task pipelining.
Contribution
This paper presents MARLIN, a new kernel design that supports high-speed batched LLM inference with quantized weights, extending GPU acceleration capabilities for practical serving scenarios.
Findings
Supports batch sizes up to 16-32 with near 4x speedup
Achieves up to 2.8x end-to-end inference speedup
Extensible to additional compression techniques like sparsity
Abstract
As inference on Large Language Models (LLMs) emerges as an important workload in machine learning applications, weight quantization has become a standard technique for efficient GPU deployment. Quantization not only reduces model size, but has also been shown to yield substantial speedups for single-user inference, due to reduced memory movement, with low accuracy impact. Yet, it remains open whether speedups are achievable also in \emph{batched} settings with multiple parallel clients, which are highly relevant for practical serving. It is unclear whether GPU kernels can be designed to remain practically memory-bound, while supporting the substantially increased compute requirements of batched workloads. This paper resolves this question positively by describing the design of Mixed-precision Auto-Regressive LINear kernels, called MARLIN. Concretely, given a model whose weights are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Healthcare
MethodsMARLIN
