Asymmetric Virtual Memory Paging for Hybrid Mamba-Transformer Inference
An Xuan Nguyen

TL;DR
This paper introduces Asymmetric Virtual Memory Paging (AVMP), a novel memory management technique for hybrid Mamba-Transformer inference that improves memory utilization and request throughput by dynamically managing two distinct cache types.
Contribution
AVMP separates and dynamically migrates between two cache pools for Key-Value and SSM states, reducing out-of-memory events and increasing throughput in hybrid language model inference.
Findings
Out-of-memory events reduced by 7.6%
Request throughput increased by up to 13.3x
Gains are statistically significant across workloads
Abstract
Hybrid language models like Jamba mix attention layers with State Space Models (SSMs), creating two memory cache types with opposite profiles: Key-Value (KV) caches grow linearly with sequence length, while SSM states stay fixed per layer. Current inference engines handle this poorly. Unified pools pad SSM states to attention page sizes, wasting up to 7.3x capacity. Static dual pools cannot adapt when prompt distributions shift between requests. We present Asymmetric Virtual Memory Paging (AVMP). The allocator separates the two cache types into physically distinct pools behind a unified virtual address space, and migrates capacity between pools when one runs out. Migration triggers only on allocation failure, keeping behavior deterministic. We evaluate AVMP across 270 synthetic cells plus 60 cells of ShareGPT trace replay on an RTX 3060 12GB. Out-of-Memory events drop 7.6% and request…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
