Multi-Token Prediction Needs Registers
Anastasios Gerontopoulos, Spyros Gidaris, Nikos Komodakis

TL;DR
MuToR is a simple, parameter-efficient method that enhances multi-token prediction by interleaving learnable register tokens, improving performance across language and vision tasks without architectural changes.
Contribution
The paper introduces MuToR, a novel approach that integrates learnable register tokens into input sequences, compatible with existing models and effective across various tasks.
Findings
Effective in supervised fine-tuning and PEFT
Supports scalable prediction horizons
Demonstrates versatility in language and vision tasks
Abstract
Multi-token prediction has emerged as a promising objective for improving language model pretraining, but its benefits have not consistently generalized to other settings such as fine-tuning. In this paper, we propose MuToR, a simple and effective approach to multi-token prediction that interleaves learnable register tokens into the input sequence, each tasked with predicting future targets. Compared to existing methods, MuToR offers several key advantages: it introduces only a negligible number of additional parameters, requires no architectural changes--ensuring compatibility with off-the-shelf pretrained language models--and remains aligned with the next-token pretraining objective, making it especially well-suited for supervised fine-tuning. Moreover, it naturally supports scalable prediction horizons. We demonstrate the effectiveness and versatility of MuToR across a range of use…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗nasos10/MuToR-gemma-2B-GSM8K-dmax_4_a_03model· 2 dl2 dl
- 🤗nasos10/MuToR-gemma-2B-1M_GSM-dmax_4_a_03model
- 🤗nasos10/MuToR-gemma-2B-1M_MATH-dmax_3_a_02model· 1 dl1 dl
- 🤗nasos10/MuToR-llama3-8B-GSM8K-dmax_4_a_03model· 1 dl1 dl
- 🤗nasos10/MuToR-llama3-8B-1M_GSM-dmax_5_a_01model· 2 dl2 dl
- 🤗nasos10/MuToR-llama3-8B-1M_MATH-dmax_4_a_01model· 5 dl5 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Machine Learning and Data Classification
