RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference
Arpit Singh Gautam, Saurabh Jha

TL;DR
RAMP is a reinforcement learning framework that adaptively assigns mixed precision bit widths to LLM layers, optimizing accuracy and efficiency for on-device inference across various models and hardware.
Contribution
It introduces a novel off-policy Soft Actor Critic method for layer-wise bit width selection, with a transferable policy and a new Scale Folding technique for stable ultra-low precision quantization.
Findings
Achieves 5.54 perplexity at 3.68GB on Llama 2 7B, outperforming uniform quantization methods.
Zero shot transfer of quantization policy to larger and different models.
Retains 99.5% of FP16 reasoning performance in efficient inference pipeline.
Abstract
Post training quantization is essential for deploying large language models (LLMs) on resource constrained hardware, yet state of the art methods enforce uniform bit widths across layers, yielding suboptimal accuracy efficiency trade offs. We present RAMP (Reinforcement Adaptive Mixed Precision), an off policy Soft Actor Critic framework that learns per layer bit width assignments to minimize perplexity under a global bit budget. The policy conditions on an 11 dimensional embedding of activation statistics, weight properties, and structural descriptors, enabling zero shot transfer across model families and scales. To enable stable sub 4 bit quantization, we introduce Scale Folding, a preconditioning technique that migrates activation outliers into weights via per channel scaling and normalization layer compensation. A quality prioritized reward with asymmetric penalties and budget…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Natural Language Processing Techniques · Big Data and Digital Economy
