Optimality and NP-Hardness of Transformers in Learning Markovian Dynamical Functions

Yanna Ding; Songtao Lu; Yingdong Lu; Tomasz Nowicki; Jianxi Gao

arXiv:2510.18638·cs.LG·November 19, 2025

Optimality and NP-Hardness of Transformers in Learning Markovian Dynamical Functions

Yanna Ding, Songtao Lu, Yingdong Lu, Tomasz Nowicki, Jianxi Gao

PDF

Open Access

TL;DR

This paper analyzes the theoretical capabilities and limitations of transformer models in learning Markovian dynamical functions, revealing NP-hardness in parameter recovery and interpreting multilayer attention as gradient descent.

Contribution

It provides a closed-form solution for single-layer linear self-attention, proves NP-hardness of parameter recovery, and offers a new interpretation of multilayer attention as gradient descent.

Findings

01

Closed-form global minimizer for single-layer LSA

02

Parameter recovery is NP-hard in general

03

Multilayer LSA acts as preconditioned gradient descent

Abstract

Transformer architectures can solve unseen tasks based on input-output pairs in a given prompt due to in-context learning (ICL). Existing theoretical studies on ICL have mainly focused on linear regression tasks, often with i.i.d. inputs. To understand how transformers express ICL when modeling dynamics-driven functions, we investigate Markovian function learning through a structured ICL setup, where we characterize the loss landscape to reveal underlying optimization behaviors. Specifically, we (1) provide the closed-form expression of the global minimizer (in an enlarged parameter space) for a single-layer linear self-attention (LSA) model; (2) prove that recovering transformer parameters that realize the optimal solution is NP-hard in general, revealing a fundamental limitation of one-layer LSA in representing structured dynamical functions; and (3) supply a novel interpretation of a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques · Neural Networks and Reservoir Computing