MLLM-CTBench: A Benchmark for Continual Instruction Tuning with Reasoning Process Diagnosis

Haiyun Guo; Zhiyan Hou; Yandu Sun; Jinghan He; Yu Chen; Yuzhe Zhou; Yuheng Jia; Jinqiao Wang; and Tat-Seng Chua

arXiv:2508.08275·cs.CL·February 16, 2026

MLLM-CTBench: A Benchmark for Continual Instruction Tuning with Reasoning Process Diagnosis

Haiyun Guo, Zhiyan Hou, Yandu Sun, Jinghan He, Yu Chen, Yuzhe Zhou, Yuheng Jia, Jinqiao Wang, and Tat-Seng Chua

PDF

Open Access

TL;DR

This paper introduces MLLM-CTBench, a comprehensive benchmark for evaluating continual instruction tuning of multimodal large language models, emphasizing reasoning process diagnosis and comparing various algorithms and training methods.

Contribution

It establishes a multidimensional evaluation framework, conducts large-scale assessments of continual learning algorithms, and explores reinforcement fine-tuning with KL-divergence control.

Findings

01

Process reasoning is more resilient to forgetting than answer accuracy.

02

Stronger models better resist catastrophic forgetting.

03

Reinforcement fine-tuning with KL control stabilizes cross-task retention.

Abstract

Continual instruction tuning(CIT) during the post-training phase is crucial for adapting multimodal large language models (MLLMs) to evolving real-world demands. However, the progress is hampered by the lack of benchmarks with rigorous, protocol-consistent evaluation. To bridge this gap, we introduce MLLM-CTBench, a comprehensive benchmark for CIT of MLLMs, covering seven challenging tasks across six diverse domains. MLLM-CTBench makes three key contributions. First, we establish a multidimensional evaluation framework that jointly assesses final-answer accuracy and process-level reasoning quality, where Chain-of-Thought (CoT) traces serve as an observable signal to diagnose catastrophic forgetting beyond answer-only evaluation. Second, we conduct a large-scale evaluation of continual learning methods by systematically assessing eight representative algorithms from four major families…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Topic Modeling