A KL Lens on Quantization: Fast, Forward-Only Sensitivity for Mixed-Precision SSM-Transformer Models

Jason Kong; Nilesh Prasad Pandey; Flavio Ponzina; Tajana Rosing

arXiv:2604.13440·cs.LG·April 16, 2026

A KL Lens on Quantization: Fast, Forward-Only Sensitivity for Mixed-Precision SSM-Transformer Models

Jason Kong, Nilesh Prasad Pandey, Flavio Ponzina, Tajana Rosing

PDF

1 Repo

TL;DR

This paper introduces a KL divergence-based sensitivity analysis method for quantizing hybrid SSM-Transformer models, enabling efficient deployment on edge devices with minimal accuracy loss.

Contribution

It proposes a surrogate, backpropagation-free framework using KL divergence to identify quantization-sensitive components in hybrid models, avoiding costly retraining.

Findings

01

KL divergence better captures quantization sensitivity than MSE and SQNR.

02

KL-based rankings align with observed performance drops in experiments.

03

Near-FP16 perplexity achieved with mixed-precision on real hardware.

Abstract

Deploying Large Language Models (LLMs) on edge devices faces severe computational and memory constraints, limiting real-time processing and on-device intelligence. Hybrid architectures combining Structured State Space Models (SSMs) with transformer-based LLMs offer a balance of efficiency and performance. Aggressive quantization can drastically cut model size and speed up inference, but its uneven effects on different components require careful management. In this work, we propose a lightweight, backpropagation-free, surrogate-based sensitivity analysis framework to identify hybrid SSM-Transformer components most susceptible to quantization-induced degradation. Relying solely on forward-pass metrics, our method avoids expensive gradient computations and retraining, making it suitable for situations where access to in-domain data is limited due to proprietary restrictions or privacy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jasonkongie/kl-ssm-quant
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.