TL;DR
This paper introduces a KL divergence-based sensitivity analysis method for quantizing hybrid SSM-Transformer models, enabling efficient deployment on edge devices with minimal accuracy loss.
Contribution
It proposes a surrogate, backpropagation-free framework using KL divergence to identify quantization-sensitive components in hybrid models, avoiding costly retraining.
Findings
KL divergence better captures quantization sensitivity than MSE and SQNR.
KL-based rankings align with observed performance drops in experiments.
Near-FP16 perplexity achieved with mixed-precision on real hardware.
Abstract
Deploying Large Language Models (LLMs) on edge devices faces severe computational and memory constraints, limiting real-time processing and on-device intelligence. Hybrid architectures combining Structured State Space Models (SSMs) with transformer-based LLMs offer a balance of efficiency and performance. Aggressive quantization can drastically cut model size and speed up inference, but its uneven effects on different components require careful management. In this work, we propose a lightweight, backpropagation-free, surrogate-based sensitivity analysis framework to identify hybrid SSM-Transformer components most susceptible to quantization-induced degradation. Relying solely on forward-pass metrics, our method avoids expensive gradient computations and retraining, making it suitable for situations where access to in-domain data is limited due to proprietary restrictions or privacy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
