TL;DR
This paper introduces token order prediction (TOP), a novel auxiliary training objective for language models that improves next-token prediction by ordering upcoming tokens, outperforming previous methods across multiple benchmarks.
Contribution
The paper proposes token order prediction (TOP), a new auxiliary task that enhances language model training by effectively ordering upcoming tokens using a simple learning-to-rank loss.
Findings
TOP outperforms NTP, MTP, and DS-MTP on nine NLP benchmarks.
TOP models perform better on math and code tasks after continued training.
TOP enables pathfinding on graphs where other methods fail.
Abstract
Multi-token prediction (MTP) has been proposed as an auxiliary objective to improve next-token prediction (NTP) in language model training but shows inconsistent improvements, underperforming in standard NLP benchmarks. We found MTP's exact future token prediction to be too difficult as an auxiliary loss. Instead, we propose token order prediction (TOP), which trains models to order upcoming tokens by their proximity using a learning-to-rank loss. TOP requires only a single additional unembedding layer compared to MTP's multiple transformer layers. We pretrain models of 340M, 1.8B, and 7B parameters using NTP, MTP, DeepSeek MTP (DS-MTP) and TOP objectives. The results of nine standard NLP benchmarks show that TOP overall outperforms NTP, MTP, and DS-MTP even at scale. TOP models with continued training on math and code also perform better on 4 relevant benchmarks. On the synthetic star…
Peer Reviews
Decision·Submitted to ICLR 2026
1. **Novel approach**: The paper proposes TOP as a theoretically sound middle ground between NTP (too myopic) and MTP (too difficult). The connection to learning-to-rank is elegant. 2. **Strong empirical results on specific tasks**: TOP consistently outperforms both baselines across most tasks, with particularly impressive results on the star graph task (100% accuracy where others fail) demonstrating genuine improvement in look-ahead reasoning capabilities.
1. **Missing critical baselines and ablations, while over-emphasizing obvious observations**: - Motivation section (page 3): Approximately two-thirds of a page is devoted to explaining Figure 2, which shows that MTP loss increases with prediction distance. This is an intuitive and expected result that does not warrant such extensive discussion. - MTP baseline incomplete: The paper only compares against Meta's MTP variant (multiple linear heads for MTP) but ignores DeepSeek's MTP architectu
1) Interesting reformulation of auxiliary objectives — The idea of relaxing MTP’s difficult prediction target into a ranking-based proximity task is conceptually appealing. 2) Simple and lightweight implementation — TOP only requires an additional unembedding layer and is compatible with existing transformer architectures. 3) Broad empirical evaluation — The experiments cover multiple model scales and both standard and synthetic benchmarks.
W1) **Figure 2 lacks clarity** The figure is supposed to show that predicting farther tokens is harder, but there’s no legend or label for each position (t+1, t+2, …). It’s unclear which curve corresponds to which distance, so the trend the authors claim isn’t visually evident. A similar plot for the TOP objective would also help illustrate whether it really leads to smoother or easier training. W2) **The claim that TOP is “easier” isn’t well supported** The paper keeps describing TOP as an
This paper is well structured and highly motivated. Replacing token identity prediction with ordinal proximity is a clever relaxation that retains lookahead signal while reducing optimization hardness. TOP requires only one extra linear layer, in contrast to MTP's per-token transformer heads. The evaluation is comprehensive, spanning multiple model sizes, diverse NLP benchmarks.
The observation that TOP achieves higher NTP training loss yet better downstream performance is interesting but underexplored. The authors hypothesize regularization but provide no ablation (e.g., varying TOP loss weight, early stopping comparisons) to confirm this. The TOP target assigns scores to all vocabulary tokens, most of which do not appear in the window. This may create a highly sparse and noisy supervision signal.
Code & Models
- 🤗zaydzuhri/vanilla-340M-4096-modelmodel
- 🤗zaydzuhri/vanilla-1.8B-4096-modelmodel
- 🤗zaydzuhri/vanilla-7B-4096-modelmodel· 3 dl3 dl
- 🤗zaydzuhri/top-7B-4096-modelmodel
- 🤗zaydzuhri/mtp-1.8B-4096-modelmodel· 1 dl1 dl
- 🤗zaydzuhri/mtp-340M-4096-modelmodel
- 🤗zaydzuhri/mtp-7B-4096-modelmodel
- 🤗zaydzuhri/top-340M-4096-modelmodel· 1 dl· ♡ 11 dl♡ 1
- 🤗zaydzuhri/top-1.8B-4096-modelmodel
- 🤗MostLime/lcm-chessmodel· ♡ 1♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
