Packing Analysis: Packing Is More Appropriate for Large Models or Datasets in Supervised Fine-tuning
Shuhe Wang, Guoyin Wang, Yizhong Wang, Jiwei Li, Eduard Hovy, Chen Guo

TL;DR
This paper provides a comprehensive analysis of packing versus padding in supervised fine-tuning, evaluating efficiency, performance, and practical considerations across various models and datasets.
Contribution
It offers the first extensive comparison of packing and padding in SFT, including benchmarks, practical insights, and open-source tools for future research.
Findings
Packing is more suitable for large models and datasets in SFT.
Packing can improve training efficiency without sacrificing performance.
The effectiveness of packing depends on model size and dataset characteristics.
Abstract
Packing, initially utilized in the pre-training phase, is an optimization technique designed to maximize hardware resource efficiency by combining different training sequences to fit the model's maximum input length. Although it has demonstrated effectiveness during pre-training, there remains a lack of comprehensive analysis for the supervised fine-tuning (SFT) stage on the following points: (1) whether packing can effectively enhance training efficiency while maintaining performance, (2) the suitable size of the model and dataset for fine-tuning with the packing method, and (3) whether packing unrelated or related training samples might cause the model to either excessively disregard or over-rely on the context. In this paper, we perform extensive comparisons between SFT methods using padding and packing, covering SFT datasets ranging from 69K to 1.2M and models from 8B to 70B. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOptimization and Packing Problems · VLSI and FPGA Design Techniques
MethodsShrink and Fine-Tune
