Improving Multimodal Large Language Models Using Continual Learning

Shikhar Srivastava; Md Yousuf Harun; Robik Shrestha; Christopher Kanan

arXiv:2410.19925·cs.CL·August 14, 2025

Improving Multimodal Large Language Models Using Continual Learning

Shikhar Srivastava, Md Yousuf Harun, Robik Shrestha, Christopher Kanan

PDF

Open Access

TL;DR

This paper addresses the challenge of integrating vision models into large language models by applying continual learning techniques to improve visual understanding without sacrificing language performance.

Contribution

It introduces a continual learning approach that enhances multimodal LLMs' visual capabilities while preserving linguistic skills, outperforming previous methods.

Findings

01

Reduces linguistic performance degradation by up to 15%.

02

Maintains high accuracy on multimodal tasks.

03

Demonstrates robustness across multiple vision-language tasks.

Abstract

Generative large language models (LLMs) exhibit impressive capabilities, which can be further augmented by integrating a pre-trained vision model into the original LLM to create a multimodal LLM (MLLM). However, this integration often significantly decreases performance on natural language understanding and generation tasks, compared to the original LLM. This study investigates this issue using the LLaVA MLLM, treating the integration as a continual learning problem. We evaluate five continual learning methods to mitigate forgetting and identify a technique that enhances visual understanding while minimizing linguistic performance loss. Our approach reduces linguistic performance degradation by up to 15% over the LLaVA recipe, while maintaining high multimodal accuracy. We also demonstrate the robustness of our method through continual learning on a sequence of vision-language tasks,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis