Transformers Learn Latent Mixture Models In-Context via Mirror Descent

Francesco D'Angelo; Nicolas Flammarion

arXiv:2604.10848·cs.LG·April 14, 2026

Transformers Learn Latent Mixture Models In-Context via Mirror Descent

Francesco D'Angelo, Nicolas Flammarion

PDF

1 Video

TL;DR

This paper formalizes how transformers learn to estimate token importance in context by implementing Mirror Descent, providing theoretical constructions and empirical evidence of this mechanism.

Contribution

It introduces a novel framework linking transformers to latent mixture models and demonstrates how they can implement Mirror Descent to learn token relevance.

Findings

01

Transformers can implement Mirror Descent to learn token importance.

02

A three-layer transformer can exactly perform one step of Mirror Descent.

03

Empirical results show trained transformers align with the theoretical model.

Abstract

Sequence modelling requires determining which past tokens are causally relevant from the context and their importance: a process inherent to the attention layers in transformers, yet whose underlying learned mechanisms remain poorly understood. In this work, we formalize the task of estimating token importance as an in-context learning problem by introducing a framework based on Mixture of Transition Distributions, where a latent variable determines the influence of past tokens on the next. The distribution over this latent variable is parameterized by unobserved mixture weights that transformers must learn in-context. We demonstrate that transformers can implement Mirror Descent to learn these weights from the context. Specifically, we give an explicit construction of a three-layer transformer that exactly implements one step of Mirror Descent and prove that the resulting estimator is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Transformers Learn Latent Mixture Models In-Context via Mirror Descent· slideslive