MM-Prompt: Cross-Modal Prompt Tuning for Continual Visual Question Answering
Xu Li, Fan Lyu

TL;DR
MM-Prompt introduces a cross-modal prompt tuning framework for continual visual question answering, effectively balancing modality engagement and improving accuracy and knowledge retention over time.
Contribution
The paper proposes MM-Prompt, a novel approach that incorporates cross-modal prompt query and recovery to address modality imbalance in continual VQA.
Findings
Outperforms prior methods in accuracy.
Enhances knowledge retention in continual learning.
Maintains balanced modality engagement.
Abstract
Continual Visual Question Answering (CVQA) based on pre-trained models(PTMs) has achieved promising progress by leveraging prompt tuning to enable continual multi-modal learning. However, most existing methods adopt cross-modal prompt isolation, constructing visual and textual prompts separately, which exacerbates modality imbalance and leads to degraded performance over time. To tackle this issue, we propose MM-Prompt, a novel framework incorporating cross-modal prompt query and cross-modal prompt recovery. The former enables balanced prompt selection by incorporating cross-modal signals during query formation, while the latter promotes joint prompt reconstruction through iterative cross-modal interactions, guided by an alignment loss to prevent representational drift. Extensive experiments show that MM-Prompt surpasses prior approaches in accuracy and knowledge retention, while…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. This work proposed a plausible solution to avoid cross-modal prompt isolation. 2. This work designed a dual-objective loss for modality interactions. 3. The experiment results and ablation studies are extensive.
1. Although being mentioned a lot, there is no justification for the modality bias. It can be good to add some sentences to explain how and why the modality bias will affect model performance. 2. The model performance is only evaluated on two datasets.
1. The approach to explicitly mitigate the dominant modality bias in pre-trained models for continual VQA is novel 2. The authors compare their approach with a variety of previous works, which validates the strength of their proposed approach 3. The authors follow an extensive ablation study validating the choice of each component used in MMPrompt.
1. Line 126-127 - “This reinforces alignment with modality-specific feature distributions, amplifies the dominant modality bias, and hinders the integration of complementary information.” - The authors provide no experiment showing that their pretrained model has a dominant modality bias. 2. The motivation behind developing a cross-modal prompt selection strategy is not clear. Many state-of-the-art large multimodal models, such as Qwen2-VL and LLaVA-OneVision, simply extract vision features from
- the proposed cross-modal interactions in the prompt selection makes sense and should improve over isolated prompts. - ablations of the method show importance of proposed new modules - Method obtains superior results on two commonly used datasets.
- Many of the cited prompting methods have not been designed for VQA. VQA having both text and visual data as input is fundamentally different from standard classification. Indeed Fig 1a does not make that much sense for VQA (also not so many papers claim it does). The paper should better explain its difference with other multi-modal prompting methods (like Khartak). Is this not also the reason for the very bad attention maps in Figure 5 ? - joint training results are missing as upperbound. Th
- The overall structure is clear, and the paper presents the modality imbalance problem in a straightforward way, making it easy to understand why existing methods may fail. - The experimental evaluation is extensive, including multiple continual learning settings (e.g., task order and memory variations) and comparisons with many prior methods, which makes the results convincing. - The authors also provide a discussion of the method’s limitations, showing awareness of the remaining challenges
- The work addresses a relevant problem and the proposed design is effective, but the innovation lies mainly in architectural refinements within the prompt-tuning pipeline rather than a fundamental methodological or theoretical advancement. - All cross-modal interaction happens at the prompt level while the multimodal backbone remains pre-trained. It is unclear whether MM-Prompt improves deep multimodal reasoning or mainly adjusts the injected prompt signals. Can authors provide some additional
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsADaptive gradient method with the OPTimal convergence rate
