Arctic Long Sequence Training: Scalable And Efficient Training For Multi-Million Token Sequences

Stas Bekman; Samyam Rajbhandari; Michael Wyatt; Jeff Rasley; Tunji Ruwase; Zhewei Yao; Aurick Qiao; Yuxiong He

arXiv:2506.13996·cs.LG·June 18, 2025

Arctic Long Sequence Training: Scalable And Efficient Training For Multi-Million Token Sequences

Stas Bekman, Samyam Rajbhandari, Michael Wyatt, Jeff Rasley, Tunji Ruwase, Zhewei Yao, Aurick Qiao, Yuxiong He

PDF

Open Access 2 Repos

TL;DR

This paper introduces Arctic Long Sequence Training (ALST), a scalable method enabling training of multi-million token sequences in language models, overcoming memory limitations and making long sequence training accessible outside enterprise environments.

Contribution

ALST provides a novel combination of attention-agnostic memory optimizations for single and multi-GPU setups, supporting multi-million token sequence training for Hugging Face models.

Findings

01

Supports 500K sequence length on a single GPU

02

Achieves 3.7M sequence length on 8 GPUs

03

Over 400x increase in sequence length compared to baseline

Abstract

Long sequences are critical for applications like RAG, long document summarization, multi-modality, etc., and modern LLMs, like Llama 4 Scout, support max sequence length of up to 10 million tokens. However, outside of enterprise labs, long sequence training is challenging for the AI community with limited system support in the open-source space. Out-of-box, even on a modern NVIDIA H100 80GB GPU cluster, training Llama 8B model with sequence over 32K runs out of memory on a basic Hugging Face (HF) model due to two reasons: i) LLM training workloads are not optimized to fully leverage a single GPU memory, ii) existing solutions for leveraging multiple GPU memory are not easily available to HF models, making long sequence training inaccessible. We address this with Arctic Long Sequence Training (ALST). It offers a combination of attention-agnostic single GPU and multi-GPU memory…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Dropout · Byte Pair Encoding · Softmax · Dense Connections · Layer Normalization · Linear Warmup With Linear Decay · BERT · BART