QMC: Efficient SLM Edge Inference via Outlier-Aware Quantization and Emergent Memories Co-Design

Nilesh Prasad Pandey; Jangseon Park; Onat Gungor; Flavio Ponzina; Tajana Rosing

arXiv:2601.14549·cs.LG·January 22, 2026

QMC: Efficient SLM Edge Inference via Outlier-Aware Quantization and Emergent Memories Co-Design

Nilesh Prasad Pandey, Jangseon Park, Onat Gungor, Flavio Ponzina, Tajana Rosing

PDF

Open Access

TL;DR

QMC introduces a hybrid memory and outlier-aware quantization approach for small language models, significantly improving efficiency and deployment viability on edge devices by reducing memory, energy, and latency.

Contribution

The paper presents a novel retraining-free quantization method combined with a heterogeneous memory architecture tailored for SLM inference on edge platforms.

Findings

01

QMC reduces memory usage by up to 7.3x.

02

QMC decreases energy consumption by 11.7x.

03

QMC achieves 12.5x lower latency compared to FP16.

Abstract

Deploying Small Language Models (SLMs) on edge platforms is critical for real-time, privacy-sensitive generative AI, yet constrained by memory, latency, and energy budgets. Quantization reduces model size and cost but suffers from device noise in emerging non-volatile memories, while conventional memory hierarchies further limit efficiency. SRAM provides fast access but has low density, DRAM must simultaneously accommodate static weights and dynamic KV caches, which creates bandwidth contention, and Flash, although dense, is primarily used for initialization and remains inactive during inference. These limitations highlight the need for hybrid memory organizations tailored to LLM inference. We propose Outlier-aware Quantization with Memory Co-design (QMC), a retraining-free quantization with a novel heterogeneous memory architecture. QMC identifies inlier and outlier weights in SLMs,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Advanced Memory and Neural Computing