NDP: Next Distribution Prediction as a More Broad Target

Junhao Ruan; Abudukeyumu Abudula; Xinyu Liu; Bei Li; Yinqiao Li,; Chenglong Wang; Yuchun Fan; Yuan Ge; Tong Xiao; Jingbo Zhu

arXiv:2408.17377·cs.CL·September 2, 2024

NDP: Next Distribution Prediction as a More Broad Target

Junhao Ruan, Abudukeyumu Abudula, Xinyu Liu, Bei Li, Yinqiao Li,, Chenglong Wang, Yuchun Fan, Yuan Ge, Tong Xiao, Jingbo Zhu

PDF

Open Access

TL;DR

This paper proposes Next Distribution Prediction (NDP), replacing one-hot targets with n-gram distributions in language models, leading to significant improvements across translation, general tasks, and medical domain adaptation.

Contribution

The paper introduces NDP, a novel training target using n-gram distributions instead of one-hot vectors, addressing limitations of the traditional next-token prediction paradigm.

Findings

01

NDP achieves up to +2.97 COMET in translation tasks.

02

NDP improves average scores by 0.61 in general tasks.

03

NDP yields +10.75 average improvement in medical domain adaptation.

Abstract

Large language models (LLMs) trained on next-token prediction (NTP) paradigm have demonstrated powerful capabilities. However, the existing NTP paradigm contains several limitations, particularly related to planned task complications and error propagation during inference. In our work, we extend the critique of NTP, highlighting its limitation also due to training with a narrow objective: the prediction of a sub-optimal one-hot distribution. To support this critique, we conducted a pre-experiment treating the output distribution from powerful LLMs as efficient world data compression. By evaluating the similarity between the $n$ -gram distribution and the one-hot distribution with LLMs, we observed that the $n$ -gram distributions align more closely with the output distribution of LLMs. Based on this insight, we introduce Next Distribution Prediction (NDP), which uses $n$ -gram…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression

MethodsALIGN