StrLoRA: Towards Streaming Continual Visual Instruction Tuning for MLLMs

Chang Che; Ziqi Wang; Hui Ma; Cheems Wang; Zenglin Shi

arXiv:2605.16353·cs.CV·May 20, 2026

StrLoRA: Towards Streaming Continual Visual Instruction Tuning for MLLMs

Chang Che, Ziqi Wang, Hui Ma, Cheems Wang, Zenglin Shi

PDF

1 Repo

TL;DR

StrLoRA introduces a novel expert routing framework for streaming continual visual instruction tuning, enabling multimodal models to learn from dynamic, interleaved data streams while mitigating forgetting.

Contribution

The paper proposes StrLoRA, a two-stage expert routing method with regularization, to improve continual learning in a realistic streaming setting for multimodal models.

Findings

01

StrLoRA outperforms existing methods on the StrCVIT benchmark.

02

It effectively distinguishes and adapts to heterogeneous task samples.

03

The approach enhances model capabilities in evolving data streams.

Abstract

Continual Visual Instruction Tuning (CVIT) enables Multimodal Large Language Models to incrementally acquire new abilities. However, existing CVIT methods operate under a restrictive task-incremental setting, where each training phase corresponds to a single, predefined task. This does not reflect real-world conditions, where data arrives as a continuous stream of interleaved and dynamically evolving tasks. To bridge this gap, we introduce Streaming CVIT (StrCVIT), a more general and realistic setting where models learn from a stream of data chunks containing a dynamic mixture of tasks. In StrCVIT, a model must simultaneously acquire new abilities, reinforce recurring abilities, and mitigate forgetting. Existing CVIT methods fail here as they cannot reliably distinguish or adapt to the heterogeneous task samples within each chunk. We therefore propose StrLoRA, a regularized two-stage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chanceche/StrCVIT
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.