Lifelong Learning in Vision-Language Models: Enhanced EWC with Cross-Modal Knowledge Retention

Hamza Ahmed Durrani; Rafay Suleman Durrani

arXiv:2605.12789·cs.RO·May 14, 2026

Lifelong Learning in Vision-Language Models: Enhanced EWC with Cross-Modal Knowledge Retention

Hamza Ahmed Durrani, Rafay Suleman Durrani

PDF

TL;DR

This paper introduces a continual learning framework for large vision-language models that reduces forgetting and preserves cross-modal alignment during sequential task learning, with minimal additional computational cost.

Contribution

It combines enhanced Elastic Weight Consolidation with multi-modal regularization techniques to improve lifelong learning in vision-language models.

Findings

01

78% reduction in forgetting rates compared to naive methods

02

Preserves cross-modal alignment with only 15% extra computational cost

03

Effective in dynamic real-world environments for autonomous and robotic systems

Abstract

Large language-vision models (LVLMs) such as CLIP, Flamingo, and BLIP have revolutionized AI by enabling understanding across textual and visual modalities. These models excel at tasks like image captioning, visual question answering, and cross-modal retrieval. However, they face catastrophic forgetting when learning new tasks sequentially, particularly challenging in multi-modal settings where preserving cross-modal alignments adds complexity to the learning process. This paper presents a comprehensive continual learning framework for LVLMs that combines enhanced Elastic Weight Consolidation (EWC) with parameter-efficient fine-tuning techniques. We integrate multi-modal Fisher Information Matrix calculation, consistency preservation across modalities, and adaptive regularization that considers dependencies across visual and textual encoders. The framework achieves a 78% reduction in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.