TL;DR
This paper investigates how width pruning of GLU-MLP layers in Llama-3.2 models selectively affects capabilities, revealing a dichotomy where factual knowledge degrades while instruction-following improves, challenging uniform degradation assumptions.
Contribution
It systematically characterizes the selective preservation phenomenon in width pruning, linking knowledge degradation with improved behavioral alignment and efficiency trade-offs.
Findings
Pruning improves instruction-following performance (+46% to +75%)
Factual knowledge capacity degrades as measured by MMLU
Pruned models achieve up to 23% energy reduction
Abstract
Structured width pruning of GLU-MLP layers, guided by the Maximum Absolute Weight (MAW) criterion, reveals a systematic dichotomy in how reducing the expansion ratio affects different model capabilities. While performance on tasks relying on parametric knowledge (e.g., MMLU, GSM8K) and perplexity metrics degrades predictably, instruction-following capabilities improve substantially (+46% to +75% in IFEval for Llama-3.2-1B and 3B models), and multi-step reasoning remains robust (MUSR). This pattern challenges the prevailing assumption that pruning induces uniform degradation. We evaluated seven expansion ratio configurations using comprehensive benchmarks assessing factual knowledge, mathematical reasoning, language comprehension, instruction-following, and truthfulness. Our analysis identifies the expansion ratio as a critical architectural parameter that selectively modulates cognitive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
