Optimality and NP-Hardness of Transformers in Learning Markovian Dynamical Functions
Yanna Ding, Songtao Lu, Yingdong Lu, Tomasz Nowicki, Jianxi Gao

TL;DR
This paper analyzes the theoretical capabilities and limitations of transformer models in learning Markovian dynamical functions, revealing NP-hardness in parameter recovery and interpreting multilayer attention as gradient descent.
Contribution
It provides a closed-form solution for single-layer linear self-attention, proves NP-hardness of parameter recovery, and offers a new interpretation of multilayer attention as gradient descent.
Findings
Closed-form global minimizer for single-layer LSA
Parameter recovery is NP-hard in general
Multilayer LSA acts as preconditioned gradient descent
Abstract
Transformer architectures can solve unseen tasks based on input-output pairs in a given prompt due to in-context learning (ICL). Existing theoretical studies on ICL have mainly focused on linear regression tasks, often with i.i.d. inputs. To understand how transformers express ICL when modeling dynamics-driven functions, we investigate Markovian function learning through a structured ICL setup, where we characterize the loss landscape to reveal underlying optimization behaviors. Specifically, we (1) provide the closed-form expression of the global minimizer (in an enlarged parameter space) for a single-layer linear self-attention (LSA) model; (2) prove that recovering transformer parameters that realize the optimal solution is NP-hard in general, revealing a fundamental limitation of one-layer LSA in representing structured dynamical functions; and (3) supply a novel interpretation of a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques · Neural Networks and Reservoir Computing
