Beware of Calibration Data for Pruning Large Language Models

Yixin Ji; Yang Xiang; Juntao Li; Qingrong Xia; Ping Li; Xinyu Duan; Zhefeng Wang; Min Zhang

arXiv:2410.17711·cs.CL·July 1, 2025

Beware of Calibration Data for Pruning Large Language Models

Yixin Ji, Yang Xiang, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, Min Zhang

PDF

Open Access 3 Reviews

TL;DR

This paper investigates the impact of calibration data on post-training pruning of large language models, revealing that data similarity and size significantly affect pruning effectiveness, and proposes a self-generating data strategy to improve results.

Contribution

It systematically explores calibration data effects on pruning, introduces a self-generation strategy, and demonstrates substantial performance improvements on recent LLMs.

Findings

01

Similar and small calibration data suffice for effective pruning.

02

Calibration data quality significantly influences pruning performance.

03

Self-generated calibration data enhances pruning results substantially.

Abstract

As large language models (LLMs) are widely applied across various fields, model compression has become increasingly crucial for reducing costs and improving inference efficiency. Post-training pruning is a promising method that does not require resource-intensive iterative training and only needs a small amount of calibration data to assess the importance of parameters. Recent research has enhanced post-training pruning from different aspects but few of them systematically explore the effects of calibration data, and it is unclear if there exist better calibration data construction strategies. We fill this blank and surprisingly observe that calibration data is also crucial to post-training pruning, especially for high sparsity. Through controlled experiments on important influence factors of calibration data, including the pruning settings, the amount of data, and its similarity with…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

The paper effectively challenges the common assumption that post-training pruning methods are robust to the choice of calibration data. Recognizing the challenge of inaccessible training data, the paper introduces a "self-generating then sampling" strategy for constructing suitable calibration data. The paper provides a detailed examination of various aspects related to the self-generating calibration data strategy

Weaknesses

While the paper shows a correlation between training data similarity and pruning performance, it doesn't explain why this connection exists. The paper's evaluation primarily centers on overall model performance. Investigating how calibration data affects the pruning of individual model components like attention heads or specific layers could be beneficial. This granular analysis would offer a more complete picture of how calibration data impacts different parts of the LLM.

Reviewer 02Rating 5Confidence 3

Strengths

The paper productively expands on prior work to answer unanswered follow up questions related to the influence of calibration data on pruning and delivers insightful findings through a set of reliable experiments. It proposes a novel and intuitive approach for the synthesis of calibration data and evaluates it empirically and theoretically while experimentally justifying major hyperparameter choices. They show that the approach can improve by up to 2.6% over using an out-of-distribution calibrat

Weaknesses

The main results are not so well represented. In Table 2, the proposed calibration data synthesis approach frequently falls behind other sources of calibration data. It’s not highlighted in the table (e.g., using colors or otherwise) whether each source was present in the training set of the evaluated LLM. That is, it makes sense to have separate comparisons for the proposed approach with each of (i), data the model was not trained on and (ii), data the model was trained on, but these seem to be

Reviewer 03Rating 8Confidence 5

Strengths

1. This paper introduces a criterion and construction strategy for choosing calibration data in post-training pruning, supported by extensive experimental validation. 2. The authors conduct experiments on various LLMs and pruning methods, with multiple repetitions, to eliminate the effects of randomness. 3. The paper is well-organized, clearly presenting the empirical studies, methodology, experiments, and results, making it easy for readers to follow the authors' arguments.

Weaknesses

1. This paper only conducts experiments on unstructured and semi-structured pruning settings and does not validate the effectiveness of synthetic calibration data in more practical structured pruning. 2. The synthetic calibration data is not a method first proposed by the authors. A recent work by Shin et al.[1] also proposed synthetic calibration data. However, the authors do not discuss the differences between that work and the others. 3. This paper only uses data from Wikipedia to generate sy

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsPruning