RWKVQuant: Quantizing the RWKV Family with Proxy Guided Hybrid of Scalar and Vector Quantization
Chen Xu, Yuxuan Yue, Zukang Xu, Xing Hu, Jiangyong Yu, Zhixuan Chen,, Sifan Zhou, Zhihang Yuan, Dawei Yang

TL;DR
This paper introduces RWKVQuant, a specialized post-training quantization framework for RWKV models, that significantly reduces model size and inference latency with minimal accuracy loss by addressing unique quantization challenges.
Contribution
It proposes a novel proxy-guided hybrid quantization approach and codebook optimization tailored for RWKV, improving quantization performance and efficiency.
Findings
Quantizes RWKV-6-14B to 3-bit with less than 1% accuracy loss
Achieves 2.14x speedup in inference
Addresses non-linear operator and weight distribution challenges in RWKV
Abstract
RWKV is a modern RNN architecture with comparable performance to Transformer, but still faces challenges when deployed to resource-constrained devices. Post Training Quantization (PTQ), which is a an essential technique to reduce model size and inference latency, has been widely used in Transformer models. However, it suffers significant degradation of performance when applied to RWKV. This paper investigates and identifies two key constraints inherent in the properties of RWKV: (1) Non-linear operators hinder the parameter-fusion of both smooth- and rotation-based quantization, introducing extra computation overhead. (2) The larger amount of uniformly distributed weights poses challenges for cluster-based quantization, leading to reduced accuracy. To this end, we propose RWKVQuant, a PTQ framework tailored for RWKV models, consisting of two novel techniques: (1) a coarse-to-fine proxy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Network Packet Processing and Optimization · Advanced Data Compression Techniques
