Fundamental Limits of Prompt Tuning Transformers: Universality, Capacity and Efficiency
Jerry Yao-Chieh Hu, Wei-Po Wang, Ammar Gilani, Chenyang Li, Zhao Song, Han Liu

TL;DR
This paper explores the theoretical limits of prompt tuning in transformer models, demonstrating universality, capacity bounds, and computational efficiency constraints, with implications for designing effective prompt tuning methods.
Contribution
It establishes the universality and efficiency bounds of prompt tuning on single-head transformers with one self-attention layer, including lower bounds and phase transition phenomena.
Findings
Prompt tuning on simple transformers is universal for Lipschitz functions.
Exponential lower bounds on prompt tokens needed for memorization.
Existence of almost-linear time prompt tuning algorithms within certain conditions.
Abstract
We investigate the statistical and computational limits of prompt tuning for transformer-based foundation models. Our key contributions are prompt tuning on \emph{single-head} transformers with only a \emph{single} self-attention layer: (i) is universal, and (ii) supports efficient (even almost-linear time) algorithms under the Strong Exponential Time Hypothesis (SETH). Statistically, we prove that prompt tuning on such simplest possible transformers are universal approximators for sequence-to-sequence Lipschitz functions. In addition, we provide an exponential-in- and -in- lower bound on the required soft-prompt tokens for prompt tuning to memorize any dataset with 1-layer, 1-head transformers. Computationally, we identify a phase transition in the efficiency of prompt tuning, determined by the norm of the \emph{soft-prompt-induced} keys and queries, and provide an…
Peer Reviews
Decision·ICLR 2025 Poster
1. The paper advances the theory of prompt tuning by proving that even the simplest transformer architectures—with a single attention head and a single self-attention layer—can universally approximate any Lipschitz continuous sequence-to-sequence function. 2. The paper notably discovers a phase transition in computational efficiency based on the norms of the soft-prompt-induced keys and queries. It provides a criterion under which prompt tuning can be performed in sub-quadratic time and demons
1. The paper contains numerous grammatical errors, and some sentences are poorly constructed. A thorough proofreading is required to enhance the readability of the text. 2. While the paper brings new applications of previous concepts to function approximation, the technical novelty appears limited, with many proofs in Section 2 overlapping significantly with those used in "Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators?" and Universality an
Overall, given the theoretical nature of the paper, it is a dense read by design, but the good presentation has helped a lot by clearly outlining the different contributions (e.g., universality results, memorization properties and the computational time analysis). The paper is quite solid in terms of the theoretical contributions. The contributions are outlined cleanly and the flow is great. The theoretical results also represent a significant contribution over existing works, such as [1]. Refe
To begin with my feedback, I'd like to mention that while I have a grasp about the high-level ideas and contributions of the paper, I have not checked the math and derivations in detail although they seem reasonable; I am not an expert in this area, so I might not have sufficient knowledge in judging the merits and significance of the work w.r.t. the previous works, so I defer the assessment of these parts to other reviewers and the AC. I understand the paper outlines many "there exists" type
The paper is technically strong. It provides several improvements over previous results on the topic. While I did not verify every single proof, the overall argument seems sound and the conclusion plausible. The paper is fairly well-written for such a technically involved paper, and I believe the results in this paper will be useful for future work studying the approximation properties of transformers.
Overall, while the paper is technically solid, I have a hard time seeing it being impactful for communities beyond those studying the representational power of transformers. I consider myself to be reasonably well-informed about the development of deep learning theory. These kinds of approximation results are certainly nice to know, but they rarely make any impact in practice or even inform any algorithmic design. For example, the authors claim that: > These fundamental limits provide importa
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSensor Technology and Measurement Systems · Advanced Electrical Measurement Techniques · Magnetic Field Sensors Techniques
