Arctic Long Sequence Training: Scalable And Efficient Training For Multi-Million Token Sequences
Stas Bekman, Samyam Rajbhandari, Michael Wyatt, Jeff Rasley, Tunji Ruwase, Zhewei Yao, Aurick Qiao, Yuxiong He

TL;DR
This paper introduces Arctic Long Sequence Training (ALST), a scalable method enabling training of multi-million token sequences in language models, overcoming memory limitations and making long sequence training accessible outside enterprise environments.
Contribution
ALST provides a novel combination of attention-agnostic memory optimizations for single and multi-GPU setups, supporting multi-million token sequence training for Hugging Face models.
Findings
Supports 500K sequence length on a single GPU
Achieves 3.7M sequence length on 8 GPUs
Over 400x increase in sequence length compared to baseline
Abstract
Long sequences are critical for applications like RAG, long document summarization, multi-modality, etc., and modern LLMs, like Llama 4 Scout, support max sequence length of up to 10 million tokens. However, outside of enterprise labs, long sequence training is challenging for the AI community with limited system support in the open-source space. Out-of-box, even on a modern NVIDIA H100 80GB GPU cluster, training Llama 8B model with sequence over 32K runs out of memory on a basic Hugging Face (HF) model due to two reasons: i) LLM training workloads are not optimized to fully leverage a single GPU memory, ii) existing solutions for leveraging multiple GPU memory are not easily available to HF models, making long sequence training inaccessible. We address this with Arctic Long Sequence Training (ALST). It offers a combination of attention-agnostic single GPU and multi-GPU memory…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Dropout · Byte Pair Encoding · Softmax · Dense Connections · Layer Normalization · Linear Warmup With Linear Decay · BERT · BART
