Refining Packing and Shuffling Strategies for Enhanced Performance in   Generative Language Models

Yanbing Chen; Ruilin Wang; Zihao Yang; Lavender Yao Jiang; Eric Karl; Oermann

arXiv:2408.09621·cs.CL·August 20, 2024

Refining Packing and Shuffling Strategies for Enhanced Performance in Generative Language Models

Yanbing Chen, Ruilin Wang, Zihao Yang, Lavender Yao Jiang, Eric Karl, Oermann

PDF

Open Access

TL;DR

This paper compares packing strategies for training language models, finding that matching atom size to maximum sequence length optimizes performance, with padding offering lower perplexity but less efficiency than concatenation.

Contribution

It systematically evaluates the impact of atom size and packing strategies on language model training performance and efficiency.

Findings

01

Matching atom size to MSL optimizes model performance.

02

Padding packing yields lower perplexity than concatenation.

03

Padding requires more training steps and reduces compute efficiency.

Abstract

Packing and shuffling tokens is a common practice in training auto-regressive language models (LMs) to prevent overfitting and improve efficiency. Typically documents are concatenated to chunks of maximum sequence length (MSL) and then shuffled. However setting the atom size, the length for each data chunk accompanied by random shuffling, to MSL may lead to contextual incoherence due to tokens from different documents being packed into the same chunk. An alternative approach is to utilize padding, another common data packing strategy, to avoid contextual incoherence by only including one document in each shuffled chunk. To optimize both packing strategies (concatenation vs padding), we investigated the optimal atom size for shuffling and compared their performance and efficiency. We found that matching atom size to MSL optimizes performance for both packing methods (concatenation and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems