TL;DR
Token-Superposition Training (TST) is a simple, efficient method that enhances data throughput during large language model pre-training by combining tokens, leading to significant reductions in training time without architectural changes.
Contribution
The paper introduces TST, a novel drop-in technique that improves pre-training efficiency by superposing tokens, validated across multiple model scales and outperforming baseline methods.
Findings
TST achieves up to 2.5x reduction in pre-training time at 10B scale.
TST is robust across different model sizes and settings.
It outperforms baseline loss and downstream evaluations.
Abstract
Pre-training of Large Language Models is often prohibitively expensive and inefficient at scale, requiring complex and invasive modifications in order to achieve high data throughput. In this work, we present Token-Superposition Training (TST), a simple drop-in method that significantly improves the data throughput per FLOPs during pre-training without modifying the parallelism, optimizer, tokenizer, data, or model architecture. TST is done in two phases: (i) A highly efficient superposition phase where we combine many contiguous tokens into one bag and train using a multi-hot cross-entropy (MCE) objective, and (ii) a recovery phase where we revert back to standard training. We extensively evaluate TST on the scale of 270M and 600M parameters and validate on 3B and a 10B A1B mixture of experts model, demonstrating that it is highly robust in different settings. Ultimately, TST…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗GestaltLabs/BOREAL-10B-MoEmodel· ♡ 9♡ 9
- 🤗GestaltLabs/BOREAL-250Mmodel· ♡ 2♡ 2
- 🤗hudsongouge/AOMTS-Base-100M-3k-0MTP-v1-run2model· 115 dl· ♡ 1115 dl♡ 1
- 🤗hudsongouge/AOMTS-Base-100M-3k-1MTP-v1model· 46 dl· ♡ 146 dl♡ 1
- 🤗hudsongouge/AOMTS-Base-100M-3k-2MTP-v1model· 120 dl· ♡ 1120 dl♡ 1
- 🤗hudsongouge/AOMTS-TST-s6-100M-3k-0MTP-v1model· 116 dl· ♡ 1116 dl♡ 1
- 🤗hudsongouge/AOMTS-TST-s6-100M-3k-1MTP-v1model· 114 dl· ♡ 1114 dl♡ 1
- 🤗hudsongouge/AOMTS-TST-s6-100M-3k-2MTP-v1model· 114 dl· ♡ 1114 dl♡ 1
- 🤗hudsongouge/AOMTS-TST-s6-100M-3k-0MTP-RESET-v1model· 128 dl· ♡ 1128 dl♡ 1
- 🤗hudsongouge/AOMTS-Base-100M-3k-1MTP-Cosine-v1model· 117 dl· ♡ 1117 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
