MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference

Anurita Das

arXiv:2604.21026·cs.LG·April 27, 2026

MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference

Anurita Das

PDF

TL;DR

MCAP is a load-time profiling method that dynamically optimizes memory and precision allocation for large language models on heterogeneous hardware, improving throughput and enabling models to run within constrained memory budgets.

Contribution

Introduces MCAP, a lightweight per-layer importance estimator for dynamic precision and memory placement in LLM deployment, enhancing performance and memory efficiency.

Findings

01

Achieves 1.5-1.8x higher decode throughput on NVIDIA T4.

02

Enables models to operate in previously infeasible memory regimes.

03

Allows a single set of weights to adapt across diverse memory budgets.

Abstract

Deploying large language models to heterogeneous hardware is often constrained by memory, not compute. We introduce MCAP (Monte Carlo Activation Profiling), a load-time per-layer importance estimator that enables dynamic precision and memory placement decisions on the target device. MCAP produces a lightweight per-layer signal that drives both precision dispatch (W4A8 vs. W4A16) and residency tier (GPU, RAM, SSD), allowing a single set of weights to operate across diverse memory budgets. Our system, NVE, achieves 1.5-1.8x higher decode throughput than llama-cpp Q4_0 on NVIDIA T4 and enables models to run in memory regimes previously infeasible without modifying weights.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.