Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering
Akash Gupta, Amos Storkey, Mirella Lapata

TL;DR
This paper introduces a meta-learning approach that distills task-relevant visual features into soft prompts, enabling large multimodal models to adapt effectively to new visual question answering tasks with minimal data and outperforming traditional in-context learning.
Contribution
It proposes a novel meta-learning method with soft prompt distillation and an attention-mapper for improved few-shot VQA adaptation in large multimodal models.
Findings
Achieves 21.2% improvement over in-context learning.
Enhances adaptation by 7.7% over parameter-efficient fine-tuning methods.
Effective in low-data regimes with just a few gradient steps.
Abstract
Large Multimodal Models (LMMs) often rely on in-context learning (ICL) to perform new visual question answering (VQA) tasks with minimal supervision. However, ICL performance, especially in smaller LMMs, does not always improve monotonically when increasing the number of examples. We hypothesize that this happens because the LMM is overwhelmed by extraneous information in the image embeddings that is irrelevant to the downstream task. To address this, we propose a meta-learning approach that induces few-shot capabilities in LMMs through a fixed set of soft prompts distilled from task-relevant visual features, which are adapted at test time using a small number of examples. We facilitate this distillation through an attention-mapper module that can be easily integrated with any LMM architecture and is jointly learned with soft prompts. Evaluation on the VL-ICL Bench shows that our method…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper is very clear and comprehensive. Experiments are well defined and serve a clear purpose. - New approach to the problem which gives clear improvement in the results. - Good ablation study to analyze specific aspects of the methods.
- The paper indicates that existing methods cannot leverage the benefits of having longer contexts. This is clearly demonstrated through experimentation. But some more theoretical and intuition basis for this could be given. - The method can be applied to any existing method but only shown for a limited number of models. It is also not made explicit what is required for a model to be able to apply your method. You are using prompts and not every model is using exactly the same type of prompts.
The proposed method is novel, and shows promising improvements on few-shot VQA. The experiments also provide several interesting insights about the role of attention mapper, soft prompts, and various fine-tuning strategies.
1. The description of how meta tasks are constructed lacks clarity. The paper must provide a clear table describing the number and composition of meta tasks used for fine-tuning and test-time fine-tuning. It is therefore hard to understand how the test accuracies are computed and can be compared. 2. Looking at the Appendix, it seems that different baselines are fine-tuned with different composition of meta-tasks (Appendix A.1.3): e.g., MAPD has 10-10 (support-query), whereas Multi-Task has 5-5,
1. Novel and effective integration of meta-learning with multimodal prompt distillation: MAPD is the first approach to apply MAML-style bi-level optimization to distill visual features into soft prompts for LMMs. Its lightweight attention-mapper (~24M trainable parameters) enables efficient task adaptation, achieving state-of-the-art few-shot VQA performance across diverse VL-ICL tasks, while scaling reliably with support set size. 2. Rigorous experimental design with strong baselines and repro
1. The paper attributes the non-monotonic improvement in ICL performance of small-parameter LMMs to "irrelevant information interference in image embedding," but it does not rule out other contributing factors, such as the inherent limitations of small models or deviations in instruction understanding. 2. While much prior work has focused on optimizing the projection layer of LMMs, the advantages of this paper relative to existing methods appear limited. his study focuses on scenarios involving
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
MethodsSparse Evolutionary Training
