Improving Multimodal Large Language Models Using Continual Learning
Shikhar Srivastava, Md Yousuf Harun, Robik Shrestha, Christopher Kanan

TL;DR
This paper addresses the challenge of integrating vision models into large language models by applying continual learning techniques to improve visual understanding without sacrificing language performance.
Contribution
It introduces a continual learning approach that enhances multimodal LLMs' visual capabilities while preserving linguistic skills, outperforming previous methods.
Findings
Reduces linguistic performance degradation by up to 15%.
Maintains high accuracy on multimodal tasks.
Demonstrates robustness across multiple vision-language tasks.
Abstract
Generative large language models (LLMs) exhibit impressive capabilities, which can be further augmented by integrating a pre-trained vision model into the original LLM to create a multimodal LLM (MLLM). However, this integration often significantly decreases performance on natural language understanding and generation tasks, compared to the original LLM. This study investigates this issue using the LLaVA MLLM, treating the integration as a continual learning problem. We evaluate five continual learning methods to mitigate forgetting and identify a technique that enhances visual understanding while minimizing linguistic performance loss. Our approach reduces linguistic performance degradation by up to 15% over the LLaVA recipe, while maintaining high multimodal accuracy. We also demonstrate the robustness of our method through continual learning on a sequence of vision-language tasks,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
