NDP: Next Distribution Prediction as a More Broad Target
Junhao Ruan, Abudukeyumu Abudula, Xinyu Liu, Bei Li, Yinqiao Li,, Chenglong Wang, Yuchun Fan, Yuan Ge, Tong Xiao, Jingbo Zhu

TL;DR
This paper proposes Next Distribution Prediction (NDP), replacing one-hot targets with n-gram distributions in language models, leading to significant improvements across translation, general tasks, and medical domain adaptation.
Contribution
The paper introduces NDP, a novel training target using n-gram distributions instead of one-hot vectors, addressing limitations of the traditional next-token prediction paradigm.
Findings
NDP achieves up to +2.97 COMET in translation tasks.
NDP improves average scores by 0.61 in general tasks.
NDP yields +10.75 average improvement in medical domain adaptation.
Abstract
Large language models (LLMs) trained on next-token prediction (NTP) paradigm have demonstrated powerful capabilities. However, the existing NTP paradigm contains several limitations, particularly related to planned task complications and error propagation during inference. In our work, we extend the critique of NTP, highlighting its limitation also due to training with a narrow objective: the prediction of a sub-optimal one-hot distribution. To support this critique, we conducted a pre-experiment treating the output distribution from powerful LLMs as efficient world data compression. By evaluating the similarity between the -gram distribution and the one-hot distribution with LLMs, we observed that the -gram distributions align more closely with the output distribution of LLMs. Based on this insight, we introduce Next Distribution Prediction (NDP), which uses -gram…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression
MethodsALIGN
