Improving Next Tokens via Second-to-Last Predictions with Generate and   Refine

Johannes Schneider

arXiv:2411.15661·cs.CL·February 17, 2025

Improving Next Tokens via Second-to-Last Predictions with Generate and Refine

Johannes Schneider

PDF

Open Access

TL;DR

This paper introduces a method to improve next token prediction in language models by training a model to predict second-to-last tokens and combining it with standard GPT predictions, resulting in over 15% accuracy gains.

Contribution

The paper presents a novel decoder-only model trained for second-to-last token prediction and a generate-then-refine approach that enhances GPT's next token accuracy.

Findings

01

Second-to-last token predictions are over 15% more accurate than next token predictions.

02

The generate-then-refine method yields consistent improvements in next-token prediction accuracy.

03

The approach offers higher training efficiency compared to BERT-style models.

Abstract

Autoregressive language models like GPT aim to predict next tokens, while autoencoding models such as BERT are trained on tasks such as predicting masked tokens. We train a decoder-only architecture for predicting the second to last token for a sequence of tokens. Our approach yields higher computational training efficiency than BERT-style models by employing a structured deterministic approach to masking tokens. We use our model to improve the next token predictions of a standard GPT by combining both predictions in a ``generate-then-refine'' approach. We demonstrate on different variants of GPT-2 and different datasets that (not unexpectedly) second to last token predictions are much more accurate, i.e., more than 15\% higher accuracy than standard next token predictions. The ``generate-then-refine'' approach also demonstrates notable improvements in next-token predictions, yielding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Algorithms and Data Compression

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Discriminative Fine-Tuning · Cosine Annealing · Linear Layer · Byte Pair Encoding · Adam · Residual Connection · Weight Decay · Softmax · Attention Is All You Need