TL;DR
This paper provides a theoretical and empirical analysis of in-context learning with transformer models in non-stationary environments, highlighting the advantages of gated linear attention for adapting to evolving tasks.
Contribution
It introduces a theoretical framework for understanding ICL in non-stationary settings and demonstrates the benefits of gating mechanisms over standard attention.
Findings
Gated linear attention adapts to changing input-output relationships effectively.
GLA achieves lower training and testing errors in non-stationary regression tasks.
Empirical results validate the theoretical advantages of gating mechanisms in dynamic environments.
Abstract
Transformer models have become foundational across a wide range of scientific and engineering domains due to their strong empirical performance. A key capability underlying their success is in-context learning (ICL): when presented with a short prompt from an unseen task, transformers can perform per-token and next-token predictions without any parameter updates. Recent theoretical efforts have begun to uncover the mechanisms behind this phenomenon, particularly in supervised regression settings. However, these analyses predominantly assume stationary task distributions, which overlook a broad class of real-world scenarios where the target function varies over time. In this work, we bridge this gap by providing a theoretical analysis of ICL under non-stationary regression problems. We study how the gated linear attention (GLA) mechanism adapts to evolving input-output relationships and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
