PASER: Post-Training Data Selection for Efficient Pruned Large Language Model Recovery

Bowei He; Lihao Yin; Hui-Ling Zhen; Xiaokun Zhang; Mingxuan Yuan; Chen Ma

arXiv:2502.12594·cs.CL·February 13, 2026

PASER: Post-Training Data Selection for Efficient Pruned Large Language Model Recovery

Bowei He, Lihao Yin, Hui-Ling Zhen, Xiaokun Zhang, Mingxuan Yuan, Chen Ma

PDF

Open Access

TL;DR

PASER is a novel post-training data selection method that efficiently recovers pruned large language models by identifying and prioritizing the most impactful instruction data, reducing data usage and negative effects.

Contribution

PASER introduces a capability-aware data selection framework using manifold learning and spectral clustering to improve model recovery efficiency after pruning.

Findings

01

Outperforms baseline methods in recovering model capabilities.

02

Achieves effective recovery with only 4-20% of original data.

03

Reduces negative effects by filtering irrelevant data.

Abstract

Model pruning is an effective approach for compressing large language models (LLMs). However, this process often leads to significant degradation of model capabilities. While post-training techniques such as instruction tuning are commonly employed to recover model performance, existing methods often overlook the uneven deterioration of model capabilities and incur high computational costs. Moreover, some irrelevant instructions may also introduce negative effects to model capacity recovery. To address these challenges, we propose the \textbf{P}ost-training d\textbf{A}ta \textbf{S}election method for \textbf{E}fficient pruned large language model \textbf{R}ecovery (\textbf{PASER}). PASER aims to identify instructions to recover the most compromised model capacities with a certain data budget. Our approach first applies manifold learning and spectral clustering to group recovery…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis