Improving Continual Pre-training Through Seamless Data Packing
Ruicheng Yin, Xuan Gao, Changze Lv, Xiaohua Wang, Xiaoqing Zheng, Xuanjing Huang

TL;DR
This paper introduces Seamless Packing, a novel data packing strategy for continual pre-training that improves contextual coherence and model performance by using a sliding window and optimized packing algorithms.
Contribution
We propose Seamless Packing, a new data packing method that enhances continual pre-training by preserving context and reducing truncation, outperforming baseline methods.
Findings
Outperforms baseline in 99% of settings
Improves contextual coherence during pre-training
Reduces truncation and padding issues
Abstract
Continual pre-training has demonstrated significant potential in enhancing model performance, particularly in domain-specific scenarios. The most common approach for packing data before continual pre-training involves concatenating input texts and splitting them into fixed-length sequences. While straightforward and efficient, this method often leads to excessive truncation and context discontinuity, which can hinder model performance. To address these issues, we explore the potential of data engineering to enhance continual pre-training, particularly its impact on model performance and efficiency. We propose Seamless Packing (SP), a novel data packing strategy aimed at preserving contextual information more effectively and enhancing model performance. Our approach employs a sliding window technique in the first stage that synchronizes overlapping tokens across consecutive sequences,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Neural Network Applications
MethodsADaptive gradient method with the OPTimal convergence rate
