AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning
Peifeng Zhang, Zice Qiu, Donghua Yu, Shilei Cao, Juepeng Zheng, Yutong Lu, Haohuan Fu

TL;DR
This paper introduces AIM, a method to improve continual learning in vision-language models for visual question answering by addressing asymmetric information interference, leading to better knowledge retention and generalization.
Contribution
AIM applies modality-specific masking to balance stability and plasticity in asymmetric VQA models, achieving state-of-the-art continual learning performance.
Findings
AIM outperforms existing methods in Average Performance and Forgetting metrics.
AIM better preserves generalization to new skill-concept combinations.
Experiments on VQA v2 and GQA demonstrate effectiveness.
Abstract
In continual visual question answering (VQA), existing Continual Learning (CL) methods are mostly built for symmetric, unimodal architectures. However, modern Vision-Language Models (VLMs) violate this assumption, as their trainable components are inherently asymmetric. This structural mismatch renders VLMs highly prone to catastrophic forgetting when learning from continuous data streams. Specifically, the asymmetry causes standard global regularization to favor the massive language decoder during optimization, leaving the smaller but critical visual projection layers highly vulnerable to interference. Consequently, this localized degradation leads to a severe loss of compositional reasoning capabilities. To address this, we propose Asymmetric Information Masking (AIM), which balances stability and plasticity by applying targeted masks based on modality-specific sensitivity.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
