Efficient Sequence Packing without Cross-contamination: Accelerating   Large Language Models without Impacting Performance

Mario Michael Krell; Matej Kosec; Sergio P. Perez; Andrew; Fitzgibbon

arXiv:2107.02027·cs.CL·October 7, 2022·6 cites

Efficient Sequence Packing without Cross-contamination: Accelerating Large Language Models without Impacting Performance

Mario Michael Krell, Matej Kosec, Sergio P. Perez, Andrew, Fitzgibbon

PDF

Open Access 1 Repo 5 Models 1 Datasets

TL;DR

This paper introduces a novel sequence packing method for large language model training that significantly reduces padding inefficiency without affecting model performance, leading to faster training times.

Contribution

It formalizes sequence packing as a bin packing problem and develops new algorithms that improve training efficiency while maintaining model accuracy.

Findings

01

Up to 50% padding tokens in common NLP datasets.

02

2x speedup in BERT pre-training phase 2.

03

Packed models are mathematically equivalent to original models.

Abstract

Effective training of today's large language models (LLMs) depends on large batches and long sequences for throughput and accuracy. To handle variable-length sequences on hardware accelerators, it is common practice to introduce padding tokens, so that all sequences in a batch have the same length. We show in this paper that the variation in sequence lengths in common NLP datasets is such that up to 50% of all tokens can be padding. In less common, but not extreme, cases (e.g. GLUE-cola with sequence length 128), the ratio is up to 89%. Existing methods to address the resulting inefficiency are complicated by the need to avoid cross-contamination in self-attention, by a reduction in accuracy when sequence ordering information is lost, or by customized kernel implementations only valid for specific accelerators. This paper introduces a new formalization of sequence packing in the context…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

graphcore/tutorials/tree/sdk-release-2.1/blogs_code/packedBERT
tfOfficial

Models

Datasets

Wisdom-math/wisdom-math
dataset· 67 dl
67 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning in Materials Science

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Layer Normalization · Weight Decay · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay · Residual Connection