Training LLMs Beyond Next Token Prediction -- Filling the Mutual Information Gap

Chun-Hao Yang; Bo-Han Feng; Tzu-Yuan Lai; Yan Yu Chen; Yin-Kai Dean Huang; Shou-De Lin

arXiv:2511.00198·cs.CL·November 4, 2025

Training LLMs Beyond Next Token Prediction -- Filling the Mutual Information Gap

Chun-Hao Yang, Bo-Han Feng, Tzu-Yuan Lai, Yan Yu Chen, Yin-Kai Dean Huang, Shou-De Lin

PDF

Open Access

TL;DR

This paper proposes a new training approach for large language models that focuses on predicting information-rich tokens, aiming to improve performance and efficiency over traditional next-token prediction methods.

Contribution

It introduces a principled method for selecting target tokens during training, enhancing model performance and theoretical understanding beyond conventional next-token prediction.

Findings

01

Improved performance on arithmetic tasks

02

Enhanced multi-label classification accuracy

03

Better natural-language generation quality

Abstract

Optimizing training performance in large language models (LLMs) remains an essential challenge, particularly in improving model performance while maintaining computational costs. This work challenges the conventional approach of training LLMs using next-token prediction (NTP), arguing that by predicting information-rich tokens during training, there is a more effective way to train LLMs. We investigate the impact of the proposed solution in three kinds of tasks for LLMs: arithmetic, multi-label classification of text, and natural-language generation. This work offers a principled approach to optimizing LLM training, advancing both model performance and theoretical understanding of the target-token selection strategies.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification