Lifelong Learning in Vision-Language Models: Enhanced EWC with Cross-Modal Knowledge Retention
Hamza Ahmed Durrani, Rafay Suleman Durrani

TL;DR
This paper introduces a continual learning framework for large vision-language models that reduces forgetting and preserves cross-modal alignment during sequential task learning, with minimal additional computational cost.
Contribution
It combines enhanced Elastic Weight Consolidation with multi-modal regularization techniques to improve lifelong learning in vision-language models.
Findings
78% reduction in forgetting rates compared to naive methods
Preserves cross-modal alignment with only 15% extra computational cost
Effective in dynamic real-world environments for autonomous and robotic systems
Abstract
Large language-vision models (LVLMs) such as CLIP, Flamingo, and BLIP have revolutionized AI by enabling understanding across textual and visual modalities. These models excel at tasks like image captioning, visual question answering, and cross-modal retrieval. However, they face catastrophic forgetting when learning new tasks sequentially, particularly challenging in multi-modal settings where preserving cross-modal alignments adds complexity to the learning process. This paper presents a comprehensive continual learning framework for LVLMs that combines enhanced Elastic Weight Consolidation (EWC) with parameter-efficient fine-tuning techniques. We integrate multi-modal Fisher Information Matrix calculation, consistency preservation across modalities, and adaptive regularization that considers dependencies across visual and textual encoders. The framework achieves a 78% reduction in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
