Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning
Tao Hu,Da-Wei Zhou

TL;DR
This paper introduces DRAPE, a novel prompt-learning framework that generates instance-specific prompts for multimodal continual instruction tuning, improving adaptability and reducing forgetting in large language models.
Contribution
DRAPE synthesizes cross-modal, instance-specific prompts using a novel query-based approach, advancing continual learning in multimodal large language models.
Findings
DRAPE achieves state-of-the-art performance on MCIT benchmarks.
It effectively mitigates catastrophic forgetting during sequential task learning.
Instance-specific prompt generation outperforms task-level prompt methods.
Abstract
Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, yet real-world deployment often requires continual capability expansion across sequential tasks. In such scenarios, Multimodal Continual Instruction Tuning (MCIT) aims to acquire new capabilities while limiting catastrophic forgetting. Existing methods mainly follow a module-composition paradigm: they maintain task-level prompts or LoRA experts and dynamically route or aggregate a subset of them at inference. However, samples within the same task can still differ substantially in visual scenes, question intents, and reasoning demands. This motivates instance-level adaptation to individual query-image pairs rather than only selecting or combining task-level modules. To this end, we propose DRAPE (Dynamic Cross-Modal Prompt Generation), a prompt-learning framework that synthesizes continuous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
