Towards Infinite-Long Prefix in Transformer
Yingyu Liang, Zhenmei Shi, Zhao Song, Chiwun Yang

TL;DR
This paper introduces a theoretically grounded method for efficient prefix tuning in transformers, enabling near-infinite-long prefix approximation with fewer parameters, and demonstrates competitive performance across multiple domains.
Contribution
It provides a convergence guarantee for ultra-long prefix training using NTK and proposes a practical algorithm for polynomial-small error approximation.
Findings
Achieves superior or competitive performance on vision, language, and math tasks.
Requires only a few extra trainable parameters instead of infinite-long prefix.
Demonstrates the effectiveness of the method through preliminary experiments.
Abstract
Prompting and context-based fine-tuning methods, which we call Prefix Learning, have been proposed to enhance the performance of language models on various downstream tasks. They are empirically efficient and effective, matching the performance of full parameter fine-tuning, but the theoretical understandings are limited. In this paper, we aim to address this limitation by studying their ability from the perspective of prefix length. In particular, we provide a convergence guarantee for training an ultra-long prefix in a stylized setting using the Neural Tangent Kernel (NTK) framework. Based on this strong theoretical guarantee, we design and implement an algorithm that only needs to introduce and fine-tune a few extra trainable parameters instead of an infinite-long prefix in each layer of a transformer, and can approximate the prefix attention to a guaranteed polynomial-small error.…
Peer Reviews
Decision·Submitted to ICLR 2025
* The proposed NTK-based analysis is novel and opens a new direction to understand prefix-tuning. Although the extremely-long prefix setup is not widely used yet, it seems to be an interesting direction for Transformer-based models.
* Is the "infinite-long" or "sufficiently long" prefix assumption practical (i.e., beneficial in terms of the performance)? In Table 1, m=200 is worse than m=100 case. The "Many-shot In-context Learning" paper (Agarwal et al., 2024) empirically showed that using lots of few-shot examples help, but the paper increases the length of input prompt rather than the number of training parameters; the result would not directly mapped one-to-one. * Experimental results (Tables and Figures) should indicat
1. Strong Theoretical Foundation: The paper provides a rigorous theoretical analysis of prefix learning with ultra-long prefixes, leveraging the Neural Tangent Kernel (NTK) framework. This contributes a new perspective to understanding prefix-based methods in transformers, especially regarding convergence and scaling laws. 2. Efficient Fine-Tuning Method: The proposed NTK-Attention significantly reduces the number of trainable parameters and computational complexity by replacing the large prefix
1.Incomplete Evaluation Across Different Model Architectures: The experiments presented focus primarily on a few specific architectures, such as pretrained ViT and ChatGLM3-6B. The generalizability of NTK-Attention across a broader range of transformer architectures and model sizes remains unexplored, which limits the applicability of the findings to other models commonly used in different domains. 2. Limited Applicability of NTK-Attention’s Efficiency Claims: The reduced computational complexi
Theoretical Contribution: The paper provides a solid theoretical analysis based on NTK to understand the efficiency and convergence of Prefix Learning, which deepens the theoretical foundation of this area. Efficient Training: The NTK-Attention algorithm significantly reduces computational complexity and memory requirements by replacing ultra-long prefixes with a small number of additional trainable parameters. Experimental Validation: The authors validate the performance of NTK-Attention on div
Lack of Practical Applicability of Theoretical Analysis: While this paper presents an NTK-based analysis to establish the theoretical foundation of Prefix Learning, there is a lack of empirical evaluation regarding its applicability to large-scale models. In particular, there is little discussion on how the approach involving ultra-long prefixes can be practically applied in real-world industrial settings. Implementation Complexity of NTK-Attention: The NTK-Attention algorithm introduces a smal
Code & Models
Videos
Taxonomy
TopicsPower Systems and Technologies · Power Systems Fault Detection · Power Transformer Diagnostics and Insulation
MethodsSoftmax · Attention Is All You Need
