Beyond Next Token Prediction: Patch-Level Training for Large Language Models
Chenze Shao, Fandong Meng, Jie Zhou

TL;DR
This paper introduces patch-level training for large language models, aggregating multiple tokens into higher-density units to significantly reduce training costs while maintaining performance.
Contribution
It proposes a novel patch-level training method that decreases training costs by processing shorter sequences of aggregated tokens, followed by token-level fine-tuning.
Findings
Training costs reduced to 50% of traditional methods
Model performance remains comparable to token-level training
Applicable to models from 370M to 2.7B parameters
Abstract
The prohibitive training costs of Large Language Models (LLMs) have emerged as a significant bottleneck in the development of next-generation LLMs. In this paper, we show that it is possible to significantly reduce the training costs of LLMs without sacrificing their performance. Specifically, we introduce patch-level training for LLMs, in which multiple tokens are aggregated into a unit of higher information density, referred to as a `patch', to serve as the fundamental text unit for training LLMs. During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch, thereby processing the majority of the training data at a significantly reduced cost. Following this, the model continues token-level training on the remaining training data to align with the inference mode. Experiments on a diverse range of models (370M-2.7B…
Peer Reviews
Decision·ICLR 2025 Spotlight
1. Reducing LLM training costs is an important research question. The problem formulation, motivation, and improvement objective of the paper are clear. 2. The proposed method is described clearly, and there seem to be adequate details to reproduce the work. The cost reduction of the proposed training method is significant and robust across multiple LLM sizes and NLP tasks. 3. The experiments cover a wide range of settings such as the fraction of patch-level training (lambda), patch size (K),
1. The optimal hyperparameters and intuitions derived from the work may be very specific to the settings associated with the paper. For example, the optimal values of lambda and patch size may be sensitive to the tokenizer. Therefore, the improvements of the proposed work may not generalize similarly to a new tokenizer. 2. It is not obvious whether the patch-level pretraining is suitable for larger scales. There are issues with training stability and convergence that only occur when the model s
1. the method is easy to implement and brings original contribution, i see it being used a lot in the community and industry if it will pass the proof of concept in real world scenario. 2. High quality experiments cover wide range of model sizes and data settings. In addition, wide range of benchmarks are covered to evaluate the final model performance compared with usual token level training. 3. Assuming that this approach would scale with even larger models, this work provides a significant co
* Unclear conclusions about the effect of architecture. In the architecture ablation the observations are suggesting that there are might be some use cases where subsequent token level training does not work very well. IMO this might be important to highlight better in the text to avoid overselling the method. * No numbers showing the actual speed improvements. Since all token projections are computed in patch level training, the expensive softmax operations over vocabularies have to be done. So
The paper presents an interesting approach to accelerate language model training, specifically: * the approach appears to be novel * the paper is well written and clear in the presentation * the approach is thoroughly evaluated (different (small) model sizes, including an array of ablations, an alternative approach to computing patches etc.)
* The evaluation is on the smaller scale of language mode sizes. Evaluation for larger model sizes is very challenging but it would be a large plus to understand the scalability of the method. * Table 3/4 show PPL results and draw conclusions based on the difference between PPLs in the range of ~10 and later on in ~7. This assumes that PPL is a linear measure but it is in fact an exponential quantity based on cross entropy. Therefore, the actual entropy difference between PPL 11 and PPL 10 is fa
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsALIGN
