TL;DR
This paper formalizes how transformers learn to estimate token importance in context by implementing Mirror Descent, providing theoretical constructions and empirical evidence of this mechanism.
Contribution
It introduces a novel framework linking transformers to latent mixture models and demonstrates how they can implement Mirror Descent to learn token relevance.
Findings
Transformers can implement Mirror Descent to learn token importance.
A three-layer transformer can exactly perform one step of Mirror Descent.
Empirical results show trained transformers align with the theoretical model.
Abstract
Sequence modelling requires determining which past tokens are causally relevant from the context and their importance: a process inherent to the attention layers in transformers, yet whose underlying learned mechanisms remain poorly understood. In this work, we formalize the task of estimating token importance as an in-context learning problem by introducing a framework based on Mixture of Transition Distributions, where a latent variable determines the influence of past tokens on the next. The distribution over this latent variable is parameterized by unobserved mixture weights that transformers must learn in-context. We demonstrate that transformers can implement Mirror Descent to learn these weights from the context. Specifically, we give an explicit construction of a three-layer transformer that exactly implements one step of Mirror Descent and prove that the resulting estimator is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
