Compressed-Sensing-Guided, Inference-Aware Structured Reduction for Large Language Models
Andrew Kiruluta

TL;DR
This paper introduces a unified, dynamic framework for compressing large language models by combining compressed sensing and inference-aware structured reduction, enabling adaptive, hardware-efficient model execution.
Contribution
It proposes a novel framework that integrates prompt compression with model reduction, using measurement operators and sparse recovery for adaptive, task-specific model support estimation during decoding.
Findings
Framework achieves task-conditioned, token-adaptive model support estimation.
Provides formal sample-complexity bounds under certain assumptions.
Enables GPU-efficient sparse execution paths for large language models.
Abstract
Large language models deliver strong generative performance but at the cost of massive parameter counts, memory use, and decoding latency. Prior work has shown that pruning and structured sparsity can preserve accuracy under substantial compression, while prompt-compression methods reduce latency by removing redundant input tokens. However, these two directions remain largely separate. Most model-compression methods are static and optimized offline, and they do not exploit the fact that different prompts and decoding steps activate different latent computational pathways. Prompt-compression methods reduce sequence length, but they do not adapt the executed model subnetwork. We propose a unified compressed-sensing-guided framework for dynamic LLM execution. Random measurement operators probe latent model usage, sparse recovery estimates task-conditioned and token-adaptive support sets,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
