Improving Continual Pre-training Through Seamless Data Packing

Ruicheng Yin; Xuan Gao; Changze Lv; Xiaohua Wang; Xiaoqing Zheng; Xuanjing Huang

arXiv:2505.22018·cs.CL·May 30, 2025

Improving Continual Pre-training Through Seamless Data Packing

Ruicheng Yin, Xuan Gao, Changze Lv, Xiaohua Wang, Xiaoqing Zheng, Xuanjing Huang

PDF

Open Access 1 Repo

TL;DR

This paper introduces Seamless Packing, a novel data packing strategy for continual pre-training that improves contextual coherence and model performance by using a sliding window and optimized packing algorithms.

Contribution

We propose Seamless Packing, a new data packing method that enhances continual pre-training by preserving context and reducing truncation, outperforming baseline methods.

Findings

01

Outperforms baseline in 99% of settings

02

Improves contextual coherence during pre-training

03

Reduces truncation and padding issues

Abstract

Continual pre-training has demonstrated significant potential in enhancing model performance, particularly in domain-specific scenarios. The most common approach for packing data before continual pre-training involves concatenating input texts and splitting them into fixed-length sequences. While straightforward and efficient, this method often leads to excessive truncation and context discontinuity, which can hinder model performance. To address these issues, we explore the potential of data engineering to enhance continual pre-training, particularly its impact on model performance and efficiency. We propose Seamless Packing (SP), a novel data packing strategy aimed at preserving contextual information more effectively and enhancing model performance. Our approach employs a sliding window technique in the first stage that synchronizes overlapping tokens across consecutive sequences,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

infernus-wind/seamless-packing
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Neural Network Applications

MethodsADaptive gradient method with the OPTimal convergence rate