Tailored Teaching with Balanced Difficulty: Elevating Reasoning in Multimodal Chain-of-Thought via Prompt Curriculum

Xinglong Yang; Quan Feng; Zhongying Pan; Xiang Chen; Yu Tian; Wentong Li; Shuofei Qiao; Yuxia Geng; Xingyu Zhao; Sheng-Jun Huang

arXiv:2508.18673·cs.CL·October 14, 2025

Tailored Teaching with Balanced Difficulty: Elevating Reasoning in Multimodal Chain-of-Thought via Prompt Curriculum

Xinglong Yang, Quan Feng, Zhongying Pan, Xiang Chen, Yu Tian, Wentong Li, Shuofei Qiao, Yuxia Geng, Xingyu Zhao, Sheng-Jun Huang

PDF

4 Reviews

TL;DR

This paper introduces a curriculum-based prompt selection method for Multimodal Chain-of-Thought prompting, which dynamically balances difficulty based on model perception and intrinsic complexity, leading to improved reasoning performance.

Contribution

It proposes a novel difficulty-balanced sampling strategy for prompt curriculum design that considers both model-perceived difficulty and intrinsic sample complexity.

Findings

01

Significant performance improvements across five benchmarks.

02

Reduces variability caused by random prompt sampling.

03

Enhances robustness of multimodal reasoning models.

Abstract

The effectiveness of Multimodal Chain-of-Thought (MCoT) prompting is often limited by the use of randomly or manually selected examples. These examples fail to account for both model-specific knowledge distributions and the intrinsic complexity of the tasks, resulting in suboptimal and unstable model performance. To address this, we propose a novel framework inspired by the pedagogical principle of "tailored teaching with balanced difficulty". We reframe prompt selection as a prompt curriculum design problem: constructing a well ordered set of training examples that align with the model's current capabilities. Our approach integrates two complementary signals: (1) model-perceived difficulty, quantified through prediction disagreement in an active learning setup, capturing what the model itself finds challenging; and (2) intrinsic sample complexity, which measures the inherent difficulty…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

1. This paper targets a practical issue that how to construct the prompt demonstration for multimodal CoT reasoning. 2. The experiments cover multiple VQA benchmarks and several multimodal models, showing the effectiveness of the proposed CAMS framework. 3. The overall pipeline is relatively easy to follow and easy to implement.

Weaknesses

1. The core contribution feels like a combination of existing ideas in LLMs rather than a new concept for multimodality. The improvement over simple baselines is often modest. The proposed method also introduces additional computation cost in prompt selection. 2. The proposed complexity scorer is trained on text-only data using an outdated LLM (llama-2) and applied to caption-plus-question inputs for multimodal tasks, but this paper does not offer convincing evidence that this score meaningfully

Reviewer 02Rating 2Confidence 3

Strengths

- The paper uniquely reframes MCoT prompt selection as a curriculum design problem. It innovatively combines two complementary signals—model-specific uncertainty and model-agnostic sample complexity—to create a more principled and effective selection strategy. - The empirical validation is robust, featuring extensive experiments on five benchmarks with three different MLLMs. The ablation studies are a key strength, clearly demonstrating the synergistic benefit of the two signals, and the work p

Weaknesses

- A core component of the proposed framework is the "Complexity Scorer," yet its development and impact are not fully interrogated. The scorer is trained on a dataset created via an "evolution-based metric" that relies on ChatGPT for ranking. This introduces several issues: (1) It creates a dependency on a powerful, proprietary model, which complicates reproducibility. (2) The performance of CAMS becomes implicitly dependent on the quality of this external LLM's judgments. The paper lacks a sens

Reviewer 03Rating 6Confidence 3

Strengths

1. Clear framing of prompt selection as a tailored curriculum with two orthogonal signals (model-uncertainty & data-complexity) and a balanced difficulty principle; simple and general recipe usable across MLLMs. 2. Multi-benchmark, multi-model evaluation; ablations showing both modules matter; analysis on sub-domains (ScienceQA NAT/SOC/LAN; grade bands). Reported average gains over strong baselines plus reduced seed-variance. 3. Method flow (Fig. 2) and pseudocode (Algorithm 1) make the appr

Weaknesses

1. The scorer depends on caption quality and a text-only transformation of multimodal inputs; evidence that this approximates intrinsic multimodal complexity is limited. A calibration study (e.g., correlation with human difficulty ratings or model error rates) would strengthen the claim. 2. The 0.5 uncertainty threshold, equal bucket quotas, and exemplar count are not stress-tested. Without sensitivity curves, it’s hard to disentangle whether “balanced difficulty” per se or just “not all-hard”

Reviewer 04Rating 6Confidence 4

Strengths

1. The work's primary novelty lies in creatively fusing two distinct concepts—the model-centric view of uncertainty and the data-centric view of complexity—into a single, unified framework for example selection. Elevating prompt selection to "prompt curriculum design" provides a fresh and systematic perspective for the field. 2. The experimental design is robust. Comprehensive evaluations across five diverse benchmarks and three different mainstream models strongly support the method's general

Weaknesses

1. Limited Scope of the Complexity Assessor and Its Potential Impact on Generalization: The paper's core innovation, the "complexity" dimension, is defined in a way that equates complexity with structural difficulty (i.e., more reasoning steps). This approach is effective for problems where complexity and difficulty are aligned. However, it may fail on tasks where difficulty stems from conceptual insight rather than procedural length. The paper's own results (Figure 4) hint at this, showing a sm

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.