Kinetic theory for Transformers and the lost-in-the-middle phenomenon
Mitia Duerinckx, Borjan Geshkovski, Stefano Rossi

TL;DR
This paper models causal self-attention in Transformers as a particle system, deriving a mean-field limit and explaining the 'lost-in-the-middle' phenomenon through rigorous correlation analysis.
Contribution
It introduces a novel particle system framework for causal self-attention and provides a quantitative analysis of the 'lost-in-the-middle' effect in token retrieval.
Findings
Derived a mean-field limit for the model.
Provided a closed-form solution for the correlation equation.
Rigorous explanation of the 'lost-in-the-middle' phenomenon.
Abstract
We study causal self-attention dynamics -- a toy model for decoder Transformers -- which we interpret as a non-exchangeable interacting particle system. Adapting cumulant expansions to the triangular causal dependency structure of the model, and appealing to non-hierarchical methods to estimate correlations using Glauber calculus, we prove a quantitative mean-field limit result and a next-order characterization of correlations. For iid uniformly distributed tokens, the limiting correlation equation can be solved in closed form and we obtain a rigorous explanation of the empirically observed \emph{lost-in-the-middle} phenomenon: the token retrieval profile, as a function of the source position in the prompt, is -shaped, with primacy, recency, and a unique interior minimum under an explicit smallness condition.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
