RWKVQuant: Quantizing the RWKV Family with Proxy Guided Hybrid of Scalar   and Vector Quantization

Chen Xu; Yuxuan Yue; Zukang Xu; Xing Hu; Jiangyong Yu; Zhixuan Chen,; Sifan Zhou; Zhihang Yuan; Dawei Yang

arXiv:2505.03803·cs.LG·May 8, 2025

RWKVQuant: Quantizing the RWKV Family with Proxy Guided Hybrid of Scalar and Vector Quantization

Chen Xu, Yuxuan Yue, Zukang Xu, Xing Hu, Jiangyong Yu, Zhixuan Chen,, Sifan Zhou, Zhihang Yuan, Dawei Yang

PDF

Open Access

TL;DR

This paper introduces RWKVQuant, a specialized post-training quantization framework for RWKV models, that significantly reduces model size and inference latency with minimal accuracy loss by addressing unique quantization challenges.

Contribution

It proposes a novel proxy-guided hybrid quantization approach and codebook optimization tailored for RWKV, improving quantization performance and efficiency.

Findings

01

Quantizes RWKV-6-14B to 3-bit with less than 1% accuracy loss

02

Achieves 2.14x speedup in inference

03

Addresses non-linear operator and weight distribution challenges in RWKV

Abstract

RWKV is a modern RNN architecture with comparable performance to Transformer, but still faces challenges when deployed to resource-constrained devices. Post Training Quantization (PTQ), which is a an essential technique to reduce model size and inference latency, has been widely used in Transformer models. However, it suffers significant degradation of performance when applied to RWKV. This paper investigates and identifies two key constraints inherent in the properties of RWKV: (1) Non-linear operators hinder the parameter-fusion of both smooth- and rotation-based quantization, introducing extra computation overhead. (2) The larger amount of uniformly distributed weights poses challenges for cluster-based quantization, leading to reduced accuracy. To this end, we propose RWKVQuant, a PTQ framework tailored for RWKV models, consisting of two novel techniques: (1) a coarse-to-fine proxy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Network Packet Processing and Optimization · Advanced Data Compression Techniques