Low-Rank Quantization-Aware Training for LLMs

Yelysei Bondarenko; Riccardo Del Chiaro; Markus Nagel

arXiv:2406.06385·cs.LG·September 4, 2024·2 cites

Low-Rank Quantization-Aware Training for LLMs

Yelysei Bondarenko, Riccardo Del Chiaro, Markus Nagel

PDF

Open Access 1 Repo 3 Reviews

TL;DR

The paper introduces LR-QAT, a low-rank, memory-efficient quantization-aware training method for large language models that achieves full-model performance with significantly reduced memory and training time.

Contribution

LR-QAT is a novel, lightweight QAT algorithm that employs low-rank auxiliary weights and checkpointing, enabling efficient training of LLMs without performance loss.

Findings

01

Outperforms common PTQ methods on LLMs.

02

Achieves full-model QAT performance with less memory.

03

Enables training of 7B LLMs on a single GPU.

Abstract

Large language models (LLMs) are omnipresent, however their practical deployment is challenging due to their ever increasing computational and memory demands. Quantization is one of the most effective ways to make them more compute and memory efficient. Quantization-aware training (QAT) methods, generally produce the best quantized performance, however it comes at the cost of potentially long training time and excessive memory usage, making it impractical when applying for LLMs. Inspired by parameter-efficient fine-tuning (PEFT) and low-rank adaptation (LoRA) literature, we propose LR-QAT -- a lightweight and memory-efficient QAT algorithm for LLMs. LR-QAT employs several components to save memory without sacrificing predictive performance: (a) low-rank auxiliary weights that are aware of the quantization grid; (b) a downcasting operator using fixed-point or double-packed integers and…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 4

Strengths

The paper is well-written and explained clearly. The approach is well-supported by a thorough experimental section, showcasing promising results that validate its efficiency and robustness across multiple large language models.

Weaknesses

The novelty of the approach is somewhat limited as it incorporates several known techniques rather than introducing entirely new concepts. While the paper provides detailed memory and runtime comparisons with full-model QAT (e.g., LSQ), it does not compare these metrics against other implemented baselines like LSQ, or OmniQuant, which limits the assessment of its relative efficiency.

Reviewer 02Rating 8Confidence 4

Strengths

1. The paper is well-written with adequate background and related work. 2. The paper combines LoRA + QAT techniques and applies down casting and gradient checkpointing for memory-efficient and inference-efficient LLMs 3. The key innovation lies in seamless fusion of the quantized low-rank adapters with the quantization field of frozen retrained weight unlike other LoRA inspired quantization works. 4. The empirical analysis is extensive with sufficient comparisons and ablation studies with imp

Weaknesses

1. Figure 1 probably is not referenced in the main text. Probably a deeper discussion on Fig. 1 (right) is needed with respect to goal of training LLMS on single device with 24GB. 2. It might be helpful to have a comparison table or figure that depicts the what parts of the model are being quantized and quantization scheme used across various PTQ, QAT and LoRa-inspired works referenced and compared in experiments - this would help put related works in perspective. 3. LoRA is predominantly used

Reviewer 03Rating 5Confidence 4

Strengths

LR-QAT introduces and combines several innovations designed to reduce memory use without sacrificing model performance: (1) a form of QAT with low-rank reparameterization, in which it places the low-rank weights in the integer domain to ensure they align with the quantization grid of the pretrained weights. This allows for seamless fusion during inference into a single low-bit integer matrix. (2) A downcasting operator that represents the frozen pretrained weights as low-bit INT-b (b ≤ 4) double

Weaknesses

The novelty may be limited. The proposed method combines traditional quantization and LoRA methods. The downcasting operator cast the input to one of pre-existing floating-point formats which follows previous works such as (Oberstar, 2007) and (Li et al., 2023). The gradient checkpointing mainly follow the previous work (Chen et al., 2016). The technical contribution may be limited. It only has results which finetune on a single dataset. It is better to demonstrate the performance by finetu

Code & Models

Repositories

qualcomm-ai-research/lr-qat
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Reservoir Engineering and Simulation Methods · Medical Imaging Techniques and Applications

MethodsAttentive Walk-Aggregating Graph Neural Network