MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large   Language Models

Elias Frantar; Roberto L. Castro; Jiale Chen; Torsten Hoefler; Dan; Alistarh

arXiv:2408.11743·cs.LG·August 22, 2024

MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models

Elias Frantar, Roberto L. Castro, Jiale Chen, Torsten Hoefler, Dan, Alistarh

PDF

Open Access 2 Repos

TL;DR

MARLIN introduces a novel mixed-precision kernel design that enables efficient batched inference on large language models, achieving significant speedups through advanced techniques like asynchronous memory access and task pipelining.

Contribution

This paper presents MARLIN, a new kernel design that supports high-speed batched LLM inference with quantized weights, extending GPU acceleration capabilities for practical serving scenarios.

Findings

01

Supports batch sizes up to 16-32 with near 4x speedup

02

Achieves up to 2.8x end-to-end inference speedup

03

Extensible to additional compression techniques like sparsity

Abstract

As inference on Large Language Models (LLMs) emerges as an important workload in machine learning applications, weight quantization has become a standard technique for efficient GPU deployment. Quantization not only reduces model size, but has also been shown to yield substantial speedups for single-user inference, due to reduced memory movement, with low accuracy impact. Yet, it remains open whether speedups are achievable also in \emph{batched} settings with multiple parallel clients, which are highly relevant for practical serving. It is unclear whether GPU kernels can be designed to remain practically memory-bound, while supporting the substantially increased compute requirements of batched workloads. This paper resolves this question positively by describing the design of Mixed-precision Auto-Regressive LINear kernels, called MARLIN. Concretely, given a model whose weights are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Healthcare

MethodsMARLIN