LLMs can Compress LLMs: Adaptive Pruning by Agents
Sai Varun Kodathala, Rakesh Vunnam

TL;DR
This paper presents an adaptive, agent-guided pruning method for large language models that intelligently preserves critical knowledge pathways, significantly improving performance and knowledge retention at high sparsity levels without retraining.
Contribution
It introduces a novel agent-based pruning framework that combines sensitivity profiling with self-reflection, enabling effective, model-agnostic compression of LLMs while maintaining performance.
Findings
56% relative improvement in MMLU accuracy
19x better factual knowledge retention on FreebaseQA
69% lower perplexity degradation
Abstract
As Large Language Models (LLMs) continue to scale, post-training pruning has emerged as a promising approach to reduce computational costs while preserving performance. Existing methods such as SparseGPT and Wanda achieve high sparsity through layer-wise weight reconstruction or activation-aware magnitude pruning, but rely on uniform or hand-crafted heuristics to determine per-layer sparsity ratios. Moreover, recent work has shown that pruned LLMs suffer from severe factual knowledge degradation, with structured pruning methods experiencing near-total collapse in factual question-answering capabilities. We introduce agent-guided pruning, where a foundation model acts as an adaptive pruning agent to intelligently select which layers to prune at each iteration while preserving critical knowledge pathways. Our method constructs layer-wise sensitivity profiles by combining Wanda-inspired…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
