Visual Prompt-Agnostic Evolution
Junze Wang, Lei Fan, Dezheng Zhang, Weipeng Jing, Donglin Di, Yang Song, Sidong Liu, Cong Cong

TL;DR
This paper introduces Prompt-Agnostic Evolution (PAE), a novel method to improve visual prompt tuning by modeling prompt dynamics, resulting in faster convergence and better accuracy across multiple vision tasks.
Contribution
We propose PAE, a frequency-domain and stability-inspired approach that enhances prompt evolution, accelerates training, and is compatible with various VPT methods without backbone changes.
Findings
PAE achieves 1.41x faster convergence on average.
PAE improves accuracy by 1-3% across 25 datasets.
PAE is prompt-agnostic, lightweight, and seamlessly integrates with existing VPT variants.
Abstract
Visual Prompt Tuning (VPT) adapts a frozen Vision Transformer (ViT) to downstream tasks by inserting a small number of learnable prompt tokens into the token sequence at each layer. However, we observe that existing VPT variants often suffer from unstable training dynamics, characterized by gradient oscillations. A layer-wise analysis reveals that shallow-layer prompts tend to stagnate early, while deeper-layer prompts exhibit high-variance oscillations, leading to cross-layer mismatch. These issues slow convergence and degrade final performance. To address these challenges, we propose Prompt-Agnostic Evolution (), which strengthens vision prompt tuning by explicitly modeling prompt dynamics. From a frequency-domain perspective, we initialize prompts in a task-aware direction by uncovering and propagating frequency shortcut patterns that the backbone inherently exploits…
Peer Reviews
Decision·ICLR 2026 Poster
1. Novel formulation: Reframing prompt tuning as a dynamical system using Koopman theory and Lyapunov stability is novel and mathematically grounded. 2. Comprehensive analysis: The paper clearly diagnoses VPT training instability through layer-wise gradient visualizations and supports it with quantitative results. 3. Strong empirical performance: PAE consistently improves various VPT baselines, showing strong generalization across tasks and benchmarks. 4. Prompt-agnostic applicability: The me
- Motivation clarity: While the dynamical-system framing is novel and interesting, the necessity of such complexity for solving gradient oscillation may be overstated. Simpler temporal regularization could have been compared. For example, could simpler smoothing (e.g., temporal moving average across layers) achieve comparable stability? - Dependence on frequency bias: MPA relies on identifying “frequency shortcuts,” which may not exist or be stable in non-natural image domains, limiting transfe
The primary strength of this research is its novel conceptualization of prompt tuning as a dynamical system, providing a principled framework to address the observed training instabilities. By applying the Koopman operator and Lyapunov stability theory, it moves beyond empirical heuristics and introduces an explicit mechanism to coordinate prompt updates across layers, directly tackling the optimization mismatch problem. Another key innovation is the Modal Pre-Alignment (MPA) strategy. This meth
Assumptions of the KLD Framework: The Koopman-Lyapunov Discrete Dynamical System (KLD) assumes that the prompt dynamics can be effectively modeled by a single, global, linear operator. This could create a representational bottleneck for complex tasks where different dynamics in shallow and deep layers might be more beneficial. The framework's success also hinges on the assumption that prompt evolution is approximately linear in the learned latent space, which has not been validated across divers
1. The problem of unstable VPT training is well defined. The observations on shallow- and deep-layer prompts are interesting. The clear mismatch for gradient oscillations is something researchers might be interested in. 2. The paper is easy to follow, and the problem-solving is practical. 3. The ablation study is sufficient.
1. The masking then project idea sounds similar to projection-based (a.k.a instance-aware) prompt tuning [1-2], where these papers use input projection directly to guide prompt training. The authors need to discuss them and clearly separate their differences. 2. The format in conclusion is a little bit weird. Please fix it. 3. The motivation of this paper can be clearer, for example, why the authors want to discover frequency shortcuts. I understand the observations; however, their motivation
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · EEG and Brain-Computer Interfaces · Domain Adaptation and Few-Shot Learning
