Revisit Visual Prompt Tuning: The Expressiveness of Prompt Experts
Minh Le, Anh Nguyen, Huy Nguyen, Chau Nguyen, Anh Tran, Nhat Ho

TL;DR
This paper revisits Visual Prompt Tuning, revealing its limitations in expressiveness, and introduces VAPT, a method that enhances prompt experts' adaptability, leading to significant performance gains on vision tasks with fewer parameters.
Contribution
It provides a theoretical reinterpretation of VPT using MoE structures and proposes VAPT to improve prompt experts' expressiveness and efficiency.
Findings
VAPT surpasses fully fine-tuned baselines by over 7% on VTAB-1K.
VAPT outperforms VPT with fewer parameters.
Theoretical analysis shows VAPT achieves optimal sample efficiency.
Abstract
Visual Prompt Tuning (VPT) has proven effective for parameter-efficient adaptation of pre-trained vision models to downstream tasks by inserting task-specific learnable prompt tokens. Despite its empirical success, a comprehensive theoretical understanding of VPT remains an active area of research. Building on the recently established connection between Mixture of Experts (MoE) and prompt-based methods, wherein each attention head can be conceptualized as a composition of multiple MoE models, we reinterpret VPT as the introduction of new prompt experts into these MoE structures. We identify a key limitation in existing VPT frameworks: the restricted functional expressiveness of prompt experts, which remain static and thus limited in their adaptability. To address this, we propose Visual Adaptive Prompt Tuning (VAPT), a novel method that endows prompt experts with enhanced expressiveness…
Peer Reviews
Decision·ICLR 2026 Poster
- The main motivation of this paper can provide a mathematically grounded analysis - The paper is easy to follow. - The authors provide a variety of experiments, with results on FGVC, VTAB-1K, and supervised and self-supervised pretrained backbones, showing the robustness of the proposed method. In addition, ablation studies in the Appendix are very helpful to understand the proposed method.
- My major concern is the novelty. For adaptive visual prompt tuning, there are many visual prompt tuning works (e.g., CVPT, CoCoOp, ViaPT, V2APT) already exploring visually adaptive or instance-aware prompts. Hence, the contribution in adaptivity itself is incremental rather than fundamentally new. In addition, MoE Interpretation is also heavily motivated by Le et al, who already framed attention and prompting under MoE theory. [CVPT] CVPT: Cross-Attention help Visual Prompt Tuning adapt vis
1. The paper offers a clear theoretical reinterpretation of VPT through the lens of MoE, providing both conceptual insight and mathematical grounding for understanding prompt tuning behavior. 2. The proposed Visual Adaptive Prompt Tuning (VAPT) effectively enhances the expressiveness of prompt experts by introducing input-dependent adaptive prompts while maintaining parameter efficiency. 3. Overall writing is clear and easy to follow,
1. Conceptually, VAPT’s “input-adaptive prompt experts” is similar to prompt-pool-based approaches [R1. R2]. These methods also condition prompt selection or generation on input features. Especially, [R2] generates tokens based on visual prompts based on the input. If authors could provide comparison between proposed method and existing prompt-pool-based approaches, it strengthens the novelty of works. [R1] Wang, Zifeng, et al. "Learning to prompt for continual learning." CVPR 2022. [R2] Kim
1. The experiment settings are sound with sufficient numbers of ablation studies in the Appendix. 2. The paper is easy to follow, and the motivation for introducing MoE to prompt tuning is reasonable.
1. The baselines provided in this paper are not new. More recent prompt tuning and other PEFT methods [1-4] should be included for completeness. 2. A critical problem of this paper is its novelty; [5] has proposed MoE prompt tuning as a manifold mapper, indicating that MoE design on prompt tuning can bring stronger expressivity. This work is highly related to the proposed research, although it has not been discussed. 3. Inconsistent experimental report. Table 1 includes E2VPT, while Table 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual and Cognitive Learning Processes · Cognitive Science and Education Research
