Leveraging Pre-Trained Models for Multimodal Class-Incremental Learning under Adaptive Fusion

Yukun Chen; Zihuan Qiu; Fanman Meng; Hongliang Li; Linfeng Xu; Qingbo Wu

arXiv:2506.09999·cs.LG·June 13, 2025

Leveraging Pre-Trained Models for Multimodal Class-Incremental Learning under Adaptive Fusion

Yukun Chen, Zihuan Qiu, Fanman Meng, Hongliang Li, Linfeng Xu, Qingbo Wu

PDF

Open Access

TL;DR

This paper introduces a novel multimodal class-incremental learning approach leveraging pre-trained models across vision, audio, and text, with innovative modules for feature extraction, adaptive fusion, and contrastive training, validated on multiple datasets.

Contribution

It presents a new MCIL framework with a multimodal incremental feature extractor, adaptive fusion module, and class-incremental contrastive loss, addressing multimodal integration and catastrophic forgetting.

Findings

01

Effective incremental learning across three modalities.

02

Improved cross-modal feature alignment and discrimination.

03

Validated on three diverse multimodal datasets.

Abstract

Unlike traditional Multimodal Class-Incremental Learning (MCIL) methods that focus only on vision and text, this paper explores MCIL across vision, audio and text modalities, addressing challenges in integrating complementary information and mitigating catastrophic forgetting. To tackle these issues, we propose an MCIL method based on multimodal pre-trained models. Firstly, a Multimodal Incremental Feature Extractor (MIFE) based on Mixture-of-Experts (MoE) structure is introduced to achieve effective incremental fine-tuning for AudioCLIP. Secondly, to enhance feature discriminability and generalization, we propose an Adaptive Audio-Visual Fusion Module (AAVFM) that includes a masking threshold mechanism and a dynamic feature fusion mechanism, along with a strategy to enhance text diversity. Thirdly, a novel multimodal class-incremental contrastive training loss is proposed to optimize…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Speech Recognition and Synthesis · Multimodal Machine Learning Applications

MethodsFocus