Relative Kinetic Utility for Reasoning-Aware Structural Pruning in Large Language Models
Tianhao Qian

TL;DR
This paper introduces RKU, a novel framework for structural pruning in large language models that preserves reasoning capabilities at high sparsity levels by focusing on critical structural pathways.
Contribution
The paper proposes RKU, a continuous kinetic integral approach with Fisher normalization, to improve reasoning accuracy in pruned LLMs at high sparsity, surpassing existing methods.
Findings
RKU improves reasoning accuracy at 40% sparsity on GSM8K.
RKU outperforms baseline methods in high-sparsity regimes.
RKU better preserves reasoning representations under out-of-distribution tests.
Abstract
Chain-of-Thought (CoT) prompting symbolized a huge improvement of reasoning capabilities of Large Language Models (LLMs). However, scaling up test-time computation yields extensive CoT sequences, introducing severe inference latency and key-value (KV) cache memory bottlenecks. While structural pruning offers a fundamental, hardware-aware solution to alleviate static parameter burdens, existing magnitude-based methods may cut off the neurons of CoT: by over-indexing on discrete cross-entropy objectives, these heuristics fall into a \textit{magnitude trap}: they prioritize high-frequency, low-information syntactic tokens and trigger a disappointing reasoning collapse at high sparsities (e.g., 40\%). To overcome this topological phase transition, we propose \textsc{Relative Kinetic Utility} (RKU), a novel theoretical framework that elevates discrete pruning to a continuous kinetic integral…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
