Beyond Next Token Prediction: Patch-Level Training for Large Language Models

Chenze Shao; Fandong Meng; Jie Zhou

arXiv:2407.12665·cs.CL·May 16, 2025

Beyond Next Token Prediction: Patch-Level Training for Large Language Models

Chenze Shao, Fandong Meng, Jie Zhou

PDF

Open Access 1 Repo 1 Models 3 Reviews

TL;DR

This paper introduces patch-level training for large language models, aggregating multiple tokens into higher-density units to significantly reduce training costs while maintaining performance.

Contribution

It proposes a novel patch-level training method that decreases training costs by processing shorter sequences of aggregated tokens, followed by token-level fine-tuning.

Findings

01

Training costs reduced to 50% of traditional methods

02

Model performance remains comparable to token-level training

03

Applicable to models from 370M to 2.7B parameters

Abstract

The prohibitive training costs of Large Language Models (LLMs) have emerged as a significant bottleneck in the development of next-generation LLMs. In this paper, we show that it is possible to significantly reduce the training costs of LLMs without sacrificing their performance. Specifically, we introduce patch-level training for LLMs, in which multiple tokens are aggregated into a unit of higher information density, referred to as a `patch', to serve as the fundamental text unit for training LLMs. During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch, thereby processing the majority of the training data at a significantly reduced cost. Following this, the model continues token-level training on the remaining training data to align with the inference mode. Experiments on a diverse range of models (370M-2.7B…

Peer Reviews

Decision·ICLR 2025 Spotlight

Reviewer 01Rating 6Confidence 3

Strengths

1. Reducing LLM training costs is an important research question. The problem formulation, motivation, and improvement objective of the paper are clear. 2. The proposed method is described clearly, and there seem to be adequate details to reproduce the work. The cost reduction of the proposed training method is significant and robust across multiple LLM sizes and NLP tasks. 3. The experiments cover a wide range of settings such as the fraction of patch-level training (lambda), patch size (K),

Weaknesses

1. The optimal hyperparameters and intuitions derived from the work may be very specific to the settings associated with the paper. For example, the optimal values of lambda and patch size may be sensitive to the tokenizer. Therefore, the improvements of the proposed work may not generalize similarly to a new tokenizer. 2. It is not obvious whether the patch-level pretraining is suitable for larger scales. There are issues with training stability and convergence that only occur when the model s

Reviewer 02Rating 8Confidence 4

Strengths

1. the method is easy to implement and brings original contribution, i see it being used a lot in the community and industry if it will pass the proof of concept in real world scenario. 2. High quality experiments cover wide range of model sizes and data settings. In addition, wide range of benchmarks are covered to evaluate the final model performance compared with usual token level training. 3. Assuming that this approach would scale with even larger models, this work provides a significant co

Weaknesses

* Unclear conclusions about the effect of architecture. In the architecture ablation the observations are suggesting that there are might be some use cases where subsequent token level training does not work very well. IMO this might be important to highlight better in the text to avoid overselling the method. * No numbers showing the actual speed improvements. Since all token projections are computed in patch level training, the expensive softmax operations over vocabularies have to be done. So

Reviewer 03Rating 8Confidence 4

Strengths

The paper presents an interesting approach to accelerate language model training, specifically: * the approach appears to be novel * the paper is well written and clear in the presentation * the approach is thoroughly evaluated (different (small) model sizes, including an array of ablations, an alternative approach to computing patches etc.)

Weaknesses

* The evaluation is on the smaller scale of language mode sizes. Evaluation for larger model sizes is very challenging but it would be a large plus to understand the scalability of the method. * Table 3/4 show PPL results and draw conclusions based on the difference between PPLs in the range of ~10 and later on in ~7. This assumes that PPL is a linear measure but it is in fact an exponential quantity based on cross entropy. Therefore, the actual entropy difference between PPL 11 and PPL 10 is fa

Code & Models

Repositories

shaochenze/patchtrain
pytorchOfficial

Models

🤗
kajuma/DiffLlama-1B
model· 156 dl· ♡ 3
156 dl♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsALIGN