TL;DR
This paper introduces LLaVA-DyMoE, a dynamic mixture of experts framework with drift-aware token assignment to improve continual learning in large vision language models, reducing forgetting and enhancing accuracy.
Contribution
It proposes a novel token-level assignment guidance and regularization techniques to mitigate routing-drift in MoE-based continual learning, improving model performance.
Findings
Achieves over 7% gain in mean final accuracy.
Reduces forgetting by 12% compared to baselines.
Effectively mitigates routing-drift-induced forgetting.
Abstract
Multimodal Continual Instruction Tuning aims to continually enhance Large Vision Language Models (LVLMs) by learning from new data without forgetting previously acquired knowledge. Mixture of Experts (MoE) architectures naturally facilitate this by incrementally adding new experts and expanding routers while keeping the existing ones frozen. However, despite expert isolation, MoE-based continual learners still suffer from forgetting due to routing-drift: old-task tokens become mistakenly attracted to newly added experts, degrading performance on prior tasks. We analyze the failure mode at the token level and reveal the token's dilemma: ambiguous and old tokens in new-task data offer minimal learning benefit yet induce forgetting when routed to new experts, due to their ambiguous routing assignment during training. Motivated by this, we propose LLaVA-DyMoE, a dynamic MoE framework that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
