RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference

Arpit Singh Gautam; Saurabh Jha

arXiv:2603.17891·cs.LG·March 19, 2026

RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference

Arpit Singh Gautam, Saurabh Jha

PDF

Open Access

TL;DR

RAMP is a reinforcement learning framework that adaptively assigns mixed precision bit widths to LLM layers, optimizing accuracy and efficiency for on-device inference across various models and hardware.

Contribution

It introduces a novel off-policy Soft Actor Critic method for layer-wise bit width selection, with a transferable policy and a new Scale Folding technique for stable ultra-low precision quantization.

Findings

01

Achieves 5.54 perplexity at 3.68GB on Llama 2 7B, outperforming uniform quantization methods.

02

Zero shot transfer of quantization policy to larger and different models.

03

Retains 99.5% of FP16 reasoning performance in efficient inference pipeline.

Abstract

Post training quantization is essential for deploying large language models (LLMs) on resource constrained hardware, yet state of the art methods enforce uniform bit widths across layers, yielding suboptimal accuracy efficiency trade offs. We present RAMP (Reinforcement Adaptive Mixed Precision), an off policy Soft Actor Critic framework that learns per layer bit width assignments to minimize perplexity under a global bit budget. The policy conditions on an 11 dimensional embedding of activation statistics, weight properties, and structural descriptors, enabling zero shot transfer across model families and scales. To enable stable sub 4 bit quantization, we introduce Scale Folding, a preconditioning technique that migrates activation outliers into weights via per channel scaling and normalization layer compensation. A quality prioritized reward with asymmetric penalties and budget…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Natural Language Processing Techniques · Big Data and Digital Economy