TL;DR
StrLoRA introduces a novel expert routing framework for streaming continual visual instruction tuning, enabling multimodal models to learn from dynamic, interleaved data streams while mitigating forgetting.
Contribution
The paper proposes StrLoRA, a two-stage expert routing method with regularization, to improve continual learning in a realistic streaming setting for multimodal models.
Findings
StrLoRA outperforms existing methods on the StrCVIT benchmark.
It effectively distinguishes and adapts to heterogeneous task samples.
The approach enhances model capabilities in evolving data streams.
Abstract
Continual Visual Instruction Tuning (CVIT) enables Multimodal Large Language Models to incrementally acquire new abilities. However, existing CVIT methods operate under a restrictive task-incremental setting, where each training phase corresponds to a single, predefined task. This does not reflect real-world conditions, where data arrives as a continuous stream of interleaved and dynamically evolving tasks. To bridge this gap, we introduce Streaming CVIT (StrCVIT), a more general and realistic setting where models learn from a stream of data chunks containing a dynamic mixture of tasks. In StrCVIT, a model must simultaneously acquire new abilities, reinforce recurring abilities, and mitigate forgetting. Existing CVIT methods fail here as they cannot reliably distinguish or adapt to the heterogeneous task samples within each chunk. We therefore propose StrLoRA, a regularized two-stage…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
