TL;DR
RAP is a reinforcement learning-based framework that adaptively prunes large language models during inference to optimize memory usage and performance in real-time scenarios.
Contribution
It introduces a dynamic, runtime-aware pruning method that jointly considers model weights and KV-cache, outperforming fixed heuristic approaches.
Findings
RAP outperforms state-of-the-art baselines in experiments.
It effectively adapts to runtime memory variations and workload demands.
First method to jointly consider model weights and KV-cache dynamically.
Abstract
Large language models (LLMs) excel at language understanding and generation, but their enormous computational and memory requirements hinder deployment. Compression offers a potential solution to mitigate these constraints. However, most existing methods rely on fixed heuristics and thus fail to adapt to runtime memory variations or heterogeneous KV-cache demands arising from diverse user requests. To address these limitations, we propose RAP, an elastic pruning framework driven by reinforcement learning (RL) that dynamically adjusts compression strategies in a runtime-aware manner. Specifically, RAP dynamically tracks the evolving ratio between model parameters and KV-cache across practical execution. Recognizing that FFNs house most parameters, whereas parameter -light attention layers dominate KV-cache formation, the RL agent retains only those components that maximize utility within…
Peer Reviews
Decision·Submitted to ICLR 2026
* Clearly formulates the problem of runtime-aware pruning in realistic LLM inference settings (varying input length, batch size, and memory constraints) and convincingly highlights the limitations of static pruning approaches. * Introduces Greedy Sequential Importance (GSI) as a principled mechanism that accounts for inter-layer dependencies through sequential importance re-evaluation, effectively mitigating performance degradation associated with one-shot pruning. * Employs an RL-based controll
* The construction of the calibration set used for GSI computation in Table 1 is insufficiently specified, making it difficult to rule out the possibility that benchmark test data distributions were indirectly utilized during pruning, raising concerns regarding the fairness of the evaluation protocol. * Since GSI is repeatedly recomputed based on a proxy corpus and sampled request distributions, the stability and reproducibility of the resulting importance scores remain unclear, potentially intr
- The shift from evaluating at a "fixed sparsity ratio" to a "fixed memory budget" is interesting. It more accurately reflects the deployment constraints on resource-limited devices. - The results clearly show that RAP makes more intelligent pruning decisions than static baselines, especially under aggressive memory budget. - The paper includes thorough ablation studies that demonstrate the necessity of both the GSI component and the RL agent.
- The entire motivation is built on optimizing for a Memory Budget. However, it fails to compare against the most effective and widely-adopted technique post-training quantization (e.g., INT4). A simple INT4 quantized model would occupy a smaller memory footprint than RAP's pruned FP16 model under the same budget. - In real-world applications, FP16 would not be deployed. A convincing demonstration of RAP's value would be to show that it can further reduce memory on top of a quantized model.
1. Considering parameters and KV cache as the target is novel as most pruning work optimizes only weights. 2. Design with MLP with small overhead makes it easy and efficient to deploy. 3. The ablation study shows the effectiveness of this method.
1. Only zero-shot short-answer benchmarks. But long-context tasks (where KV matters) or real generation quality would better show the purported advantage. 2. Need end-to-end latency and throughput comparison. Real-world servers aslo care about tokens/sec and tail latency besides memory savings. 3. If heads or layers are dropped at runtime, how are pre-existing KV tensors handled across decoding steps? 4. If GSI already orders blocks and the agent “iteratively removes the least important,” wh
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsSoftmax · Attention Is All You Need · Pruning
