Efficient Pre-Training with Token Superposition

Bowen Peng; Th\'eo Gigant; Jeffrey Quesnelle

arXiv:2605.06546·cs.CL·May 20, 2026

Efficient Pre-Training with Token Superposition

Bowen Peng, Th\'eo Gigant, Jeffrey Quesnelle

PDF

12 Models

TL;DR

Token-Superposition Training (TST) is a simple, efficient method that enhances data throughput during large language model pre-training by combining tokens, leading to significant reductions in training time without architectural changes.

Contribution

The paper introduces TST, a novel drop-in technique that improves pre-training efficiency by superposing tokens, validated across multiple model scales and outperforming baseline methods.

Findings

01

TST achieves up to 2.5x reduction in pre-training time at 10B scale.

02

TST is robust across different model sizes and settings.

03

It outperforms baseline loss and downstream evaluations.

Abstract

Pre-training of Large Language Models is often prohibitively expensive and inefficient at scale, requiring complex and invasive modifications in order to achieve high data throughput. In this work, we present Token-Superposition Training (TST), a simple drop-in method that significantly improves the data throughput per FLOPs during pre-training without modifying the parallelism, optimizer, tokenizer, data, or model architecture. TST is done in two phases: (i) A highly efficient superposition phase where we combine many contiguous tokens into one bag and train using a multi-hot cross-entropy (MCE) objective, and (ii) a recovery phase where we revert back to standard training. We extensively evaluate TST on the scale of 270M and 600M parameters and validate on 3B and a 10B A1B mixture of experts model, demonstrating that it is highly robust in different settings. Ultimately, TST…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.