RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models

Quan Wei; Chung-Yiu Yau; Hoi-To Wai; Yang Katie Zhao; Dongyeop Kang; Youngsuk Park; Mingyi Hong

arXiv:2502.09003·cs.LG·June 9, 2025

RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models

Quan Wei, Chung-Yiu Yau, Hoi-To Wai, Yang Katie Zhao, Dongyeop Kang, Youngsuk Park, Mingyi Hong

PDF

Open Access 1 Video

TL;DR

RoSTE is a novel quantization-aware supervised fine-tuning method for large language models that improves performance by integrating adaptive rotation strategies to reduce activation outliers and optimize quantization.

Contribution

It introduces RoSTE, combining quantization-aware fine-tuning with an adaptive rotation strategy, providing theoretical analysis and superior empirical results over existing methods.

Findings

01

RoSTE reduces quantization error through adaptive rotation.

02

It achieves better performance than post-training quantization baselines.

03

Effective across various LLM architectures and tasks.

Abstract

Supervised fine-tuning is a standard method for adapting pre-trained large language models (LLMs) to downstream tasks. Quantization has been recently studied as a post-training technique for efficient LLM deployment. To obtain quantized fine-tuned LLMs, conventional pipelines would first fine-tune the pre-trained models, followed by post-training quantization. This often yields suboptimal performance as it fails to leverage the synergy between fine-tuning and quantization. To effectively realize low-bit quantization of weights, activations and KV caches in LLMs, we propose an algorithm named Rotated Straight-Through-Estimator (RoSTE), which combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy that identifies an effective rotation configuration to reduce activation outliers. We provide theoretical insights on RoSTE by analyzing its prediction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsLLaMA · Pythia