Mini-Sequence Transformer: Optimizing Intermediate Memory for Long   Sequences Training

Cheng Luo; Jiawei Zhao; Zhuoming Chen; Beidi Chen; Anima Anandkumar

arXiv:2407.15892·cs.LG·November 12, 2024

Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training

Cheng Luo, Jiawei Zhao, Zhuoming Chen, Beidi Chen, Anima Anandkumar

PDF

1 Repo

TL;DR

Mini-Sequence Transformer (MsT) is a novel method that enables efficient training of large language models with extremely long sequences by partitioning input and reducing memory usage without sacrificing performance.

Contribution

MsT introduces a simple, general approach to extend sequence length in LLM training, significantly reducing memory requirements with minimal code modifications.

Findings

01

MsT achieves 12x longer sequences without throughput loss.

02

MsT extends context length of several models by 12-24x.

03

Memory savings enable training with longer sequences.

Abstract

We introduce Mini-Sequence Transformer (MsT), a simple and effective methodology for highly efficient and accurate LLM training with extremely long sequences. MsT partitions input sequences and iteratively processes mini-sequences to reduce intermediate memory usage. Integrated with activation recomputation, it enables significant memory savings in both forward and backward passes. In experiments with the Llama3-8B model, with MsT, we measure no degradation in throughput or convergence even with 12x longer sequences than standard implementations. MsT is fully general, implementation-agnostic, and requires minimal code changes to integrate with existing LLM training frameworks. Integrated with the huggingface library, MsT successfully extends the maximum context length of Qwen, Mistral, and Gemma-2 by 12-24x.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wdlctc/mini-s
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Byte Pair Encoding · Layer Normalization · Label Smoothing · Linear Layer · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Multi-Head Attention · Dense Connections