Future Token Prediction -- Causal Language Modelling with Per-Token Semantic State Vector for Multi-Token Prediction
Nicholas Walker

TL;DR
This paper introduces Future Token Prediction (FTP), a novel pretraining method for transformer models that predicts multiple future tokens, leading to more coherent and semantically meaningful embeddings and improved text and code generation.
Contribution
The paper proposes FTP, a new approach that generates per-token semantic vectors predicting multiple future tokens, enhancing topic coherence and semantic representation over standard next-token models.
Findings
FTP embeddings vary smoothly along text sequences.
FTP models produce more topic-coherent text.
FTP outperforms GPT in coding tasks.
Abstract
Causal decoder-only transformer models used for generative language modelling, such as Generative Pre-trained Transformers (GPT), are trained to predict the next token in a sequence based only on its previous tokens. Despite this simple training objective, they have proved to be powerful AI tools. However, only predicting the next token results in top layer embedding vectors that are highly token-focused. There may be benefits in generating embedding vectors at each token position that better capture the overall meaning of longer sequences of future text. Recent studies matching brain scans with deep language models suggest that humans also predict upcoming words when listening or reading but consider multiple future tokens rather than just one. This research investigates a new pretraining method called Future Token Prediction (FTP). In FTP, a large transformer encoder generates top…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Data Quality and Management · Natural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Cosine Annealing · Multi-Head Attention · Linear Warmup With Cosine Annealing · Adam · Softmax · Dropout · Byte Pair Encoding
