Improving Next Tokens via Second-to-Last Predictions with Generate and Refine
Johannes Schneider

TL;DR
This paper introduces a method to improve next token prediction in language models by training a model to predict second-to-last tokens and combining it with standard GPT predictions, resulting in over 15% accuracy gains.
Contribution
The paper presents a novel decoder-only model trained for second-to-last token prediction and a generate-then-refine approach that enhances GPT's next token accuracy.
Findings
Second-to-last token predictions are over 15% more accurate than next token predictions.
The generate-then-refine method yields consistent improvements in next-token prediction accuracy.
The approach offers higher training efficiency compared to BERT-style models.
Abstract
Autoregressive language models like GPT aim to predict next tokens, while autoencoding models such as BERT are trained on tasks such as predicting masked tokens. We train a decoder-only architecture for predicting the second to last token for a sequence of tokens. Our approach yields higher computational training efficiency than BERT-style models by employing a structured deterministic approach to masking tokens. We use our model to improve the next token predictions of a standard GPT by combining both predictions in a ``generate-then-refine'' approach. We demonstrate on different variants of GPT-2 and different datasets that (not unexpectedly) second to last token predictions are much more accurate, i.e., more than 15\% higher accuracy than standard next token predictions. The ``generate-then-refine'' approach also demonstrates notable improvements in next-token predictions, yielding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Algorithms and Data Compression
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Discriminative Fine-Tuning · Cosine Annealing · Linear Layer · Byte Pair Encoding · Adam · Residual Connection · Weight Decay · Softmax · Attention Is All You Need
