Do language models plan ahead for future tokens?
Wilson Wu, John X. Morris, Lionel Levine

TL;DR
This paper investigates whether transformer language models plan ahead during inference by testing hypotheses about how future information is prepared and stored, using novel training schemes and synthetic data experiments.
Contribution
It introduces myopic training to test if transformers pre-cache future features or use breadcrumbs, providing evidence for pre-caching and insights into model planning mechanisms.
Findings
Evidence supports pre-caching in synthetic data experiments.
Pre-caching increases with model scale in language modeling.
Results suggest transformers may use pre-caching or breadcrumbs for future token prediction.
Abstract
Do transformers "think ahead" during inference at a given position? It is known transformers prepare information in the hidden states of the forward pass at time step that is then used in future forward passes . We posit two explanations for this phenomenon: pre-caching, in which off-diagonal gradient terms present during training result in the model computing features at irrelevant to the present inference task but useful for the future, and breadcrumbs, in which features most relevant to time step are already the same as those that would most benefit inference at time . We test these hypotheses by training language models without propagating gradients to past timesteps, a scheme we formalize as myopic training. In a constructed synthetic data setting, we find clear evidence for pre-caching. In the autoregressive language modeling setting, our experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
