TP-Aware Dequantization

Adnan Hoque; Mudhakar Srivatsa; Chih-Chieh Yang; Raghu Ganti

arXiv:2402.04925·cs.DC·February 8, 2024·1 cites

TP-Aware Dequantization

Adnan Hoque, Mudhakar Srivatsa, Chih-Chieh Yang, Raghu Ganti

PDF

Open Access

TL;DR

This paper introduces a TP-aware dequantization method that significantly accelerates large language model inference by optimizing GPU memory access and reducing communication, achieving up to 1.81x speedup.

Contribution

It presents a novel inference scheme that addresses quantization kernel limitations with tensor parallelism, improving deployment speed of large models.

Findings

01

Up to 1.81x speedup on Llama-70B

02

Up to 1.78x speedup on IBM WatsonX Granite-20B

03

Effective across various tensor parallel settings

Abstract

In this paper, we present a novel method that reduces model inference latency during distributed deployment of Large Language Models (LLMs). Our contribution is an optimized inference deployment scheme that address the current limitations of state-of-the-art quantization kernels when used in conjunction with Tensor Parallel (TP). Our method preserves data locality in GPU memory access patterns and exploits a priori knowledge of TP to reduce global communication. We demonstrate an up to 1.81x speedup over existing methods for Llama-70B and up to 1.78x speedup for IBM WatsonX's Granite-20B MLP layer problem sizes on A100 and H100 NVIDIA DGX Systems for a variety of TP settings.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques