Leveraging Pre-Trained Models for Multimodal Class-Incremental Learning under Adaptive Fusion
Yukun Chen, Zihuan Qiu, Fanman Meng, Hongliang Li, Linfeng Xu, Qingbo Wu

TL;DR
This paper introduces a novel multimodal class-incremental learning approach leveraging pre-trained models across vision, audio, and text, with innovative modules for feature extraction, adaptive fusion, and contrastive training, validated on multiple datasets.
Contribution
It presents a new MCIL framework with a multimodal incremental feature extractor, adaptive fusion module, and class-incremental contrastive loss, addressing multimodal integration and catastrophic forgetting.
Findings
Effective incremental learning across three modalities.
Improved cross-modal feature alignment and discrimination.
Validated on three diverse multimodal datasets.
Abstract
Unlike traditional Multimodal Class-Incremental Learning (MCIL) methods that focus only on vision and text, this paper explores MCIL across vision, audio and text modalities, addressing challenges in integrating complementary information and mitigating catastrophic forgetting. To tackle these issues, we propose an MCIL method based on multimodal pre-trained models. Firstly, a Multimodal Incremental Feature Extractor (MIFE) based on Mixture-of-Experts (MoE) structure is introduced to achieve effective incremental fine-tuning for AudioCLIP. Secondly, to enhance feature discriminability and generalization, we propose an Adaptive Audio-Visual Fusion Module (AAVFM) that includes a masking threshold mechanism and a dynamic feature fusion mechanism, along with a strategy to enhance text diversity. Thirdly, a novel multimodal class-incremental contrastive training loss is proposed to optimize…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Speech Recognition and Synthesis · Multimodal Machine Learning Applications
MethodsFocus
