Reasoning Bias of Next Token Prediction Training
Pengxiao Lin, Zhongwang Zhang, Zhi-Qin John Xu

TL;DR
This paper investigates the reasoning capabilities of Large Language Models trained with next token prediction versus critical token prediction, revealing that NTP surprisingly outperforms CTP due to noise-induced regularization effects.
Contribution
The study demonstrates that next token prediction training leads to better reasoning abilities in LLMs than critical token prediction, challenging initial assumptions.
Findings
NTP-trained models show superior reasoning performance.
Noise during NTP training acts as a regularizer.
NTP models exhibit greater robustness and flatter minima.
Abstract
Since the inception of Large Language Models (LLMs), the quest to efficiently train them for superior reasoning capabilities has been a pivotal challenge. The dominant training paradigm for LLMs is based on next token prediction (NTP). Alternative methodologies, called Critical Token Prediction (CTP), focused exclusively on specific critical tokens (such as the answer in Q\&A dataset), aiming to reduce the overfitting of extraneous information and noise. Contrary to initial assumptions, our research reveals that despite NTP's exposure to noise during training, it surpasses CTP in reasoning ability. We attribute this counterintuitive outcome to the regularizing influence of noise on the training dynamics. Our empirical analysis shows that NTP-trained models exhibit enhanced generalization and robustness across various benchmark reasoning datasets, demonstrating greater resilience to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Intelligent Tutoring Systems and Adaptive Learning · Topic Modeling
