L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models

Xiaohao Liu; Xiaobo Xia; Weixiang Zhao; Manyi Zhang; Xianzhi Yu; Xiu Su; Shuo Yang; See-Kiong Ng; Tat-Seng Chua

arXiv:2505.17505·cs.CL·September 23, 2025

L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models

Xiaohao Liu, Xiaobo Xia, Weixiang Zhao, Manyi Zhang, Xianzhi Yu, Xiu Su, Shuo Yang, See-Kiong Ng, Tat-Seng Chua

PDF

1 Models 1 Datasets

TL;DR

L-MTP introduces a leap-based multi-token prediction method for large language models that improves long-range dependency capture and accelerates inference by predicting non-adjacent tokens in a single pass.

Contribution

It proposes a novel leap multi-token prediction technique that extends traditional methods, enhancing efficiency and long-range dependency modeling in LLMs.

Findings

01

Boosts inference speed significantly.

02

Improves performance on diverse benchmarks.

03

Effectively captures long-range dependencies.

Abstract

Large language models (LLMs) have achieved notable progress. Despite their success, next-token prediction (NTP), the dominant method for LLM training and inference, is constrained in both contextual coverage and inference efficiency due to its inherently sequential process. To overcome these challenges, we propose leap multi-token prediction~(L-MTP), an innovative token prediction method that extends the capabilities of multi-token prediction (MTP) by introducing a leap-based mechanism. Unlike conventional MTP, which generates multiple tokens at adjacent positions, L-MTP strategically skips over intermediate tokens, predicting non-sequential ones in a single forward pass. This structured leap not only enhances the model's ability to capture long-range dependencies but also enables a decoding strategy specially optimized for non-sequential leap token generation, effectively accelerating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
xiao-hao/L-MTP
model

Datasets

xiao-hao/self-distillation-LLMs
dataset· 80 dl
80 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.