Continual Instruction Tuning for Large Multimodal Models
Jinghan He, Haiyun Guo, Ming Tang, Jinqiao Wang

TL;DR
This paper investigates continual instruction tuning for large multimodal models, revealing persistent catastrophic forgetting and proposing methods to mitigate it, thereby enhancing model adaptability to evolving vision-language tasks.
Contribution
It introduces the first benchmark for continual instruction tuning of LMMs, analyzes forgetting dynamics, and adapts classic continual learning methods to improve performance.
Findings
Catastrophic forgetting persists in continual instruction tuning of LMMs.
Multi-task joint instruction tuning helps mitigate forgetting.
Data replay and model expansion strategies are effective in this context.
Abstract
Instruction tuning is now a widely adopted approach to aligning large multimodal models (LMMs) to follow human intent. It unifies the data format of vision-language tasks, enabling multi-task joint training. However, vision-language tasks are constantly being created in practice. Instead of always re-training LMMs when new tasks arrive, continual learning offers flexibility for models to continually and efficiently exploit the evolving data. This work aims to explore the following two questions: 1) Do LMMs still suffer from catastrophic forgetting in continual instruction tuning? 2) Are the existing three classes of continual learning methods still applicable to the continual instruction tuning of LMMs? An extensive study is conducted to address the above questions. First, we establish the first benchmark in this setting and reveal that catastrophic forgetting is still observed when…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
