GThinker: Towards General Multimodal Reasoning via Cue-Guided Rethinking
Yufei Zhan, Ziheng Wu, Yousong Zhu, Rongkun Xue, Ruipu Luo, Zhenghao Chen, Can Zhang, Yifan Li, Zhentao He, Zheming Yang, Ming Tang, Minghui Qiu, Jinqiao Wang

TL;DR
GThinker is a new multimodal reasoning model that uses cue-guided rethinking to improve visual grounding and reasoning across diverse scenarios, outperforming existing models on multiple benchmarks.
Contribution
Introduces Cue-Rethinking and a two-stage training pipeline to enhance general multimodal reasoning in large language models.
Findings
Achieves 81.5% on M$^3$CoT benchmark.
Outperforms latest models on multimodal reasoning tasks.
Maintains strong mathematical reasoning performance.
Abstract
Despite notable advancements in multimodal reasoning, leading Multimodal Large Language Models (MLLMs) still underperform on vision-centric multimodal reasoning tasks in general scenarios. This shortfall stems from their predominant reliance on logic- and knowledge-based slow thinking strategies, while effective for domains like math and science, fail to integrate visual information effectively during reasoning. Consequently, these models often fail to adequately ground visual cues, resulting in suboptimal performance in tasks that require multiple plausible visual interpretations and inferences. To address this, we present GThinker (General Thinker), a novel reasoning MLLM excelling in multimodal reasoning across general scenarios, mathematics, and science. GThinker introduces Cue-Rethinking, a flexible reasoning pattern that grounds inferences in visual cues and iteratively…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Clear problem identification and framing: The paper makes a strong conceptual contribution by identifying a key limitation in current MLLMs - their tendency to persist with initial visual interpretations even when subsequent reasoning or new contextual information exposes inconsistencies. Framing this as a visual rethinking problem, and proposing to address it through an adaptive cue-rethinking mechanism, is both novel and well-motivated. 2. Another notable strength is the pattern-guided co
Overall concern: While the proposed approach is intuitively sound, many of the claims appear overstated, and the empirical gains are either marginal or dataset-specific. The improvements on M3CoT may largely stem from in-distribution bias (the training dataset details in appendix section A.1), making it difficult to conclude that the method has broad impact or generalizable benefits. If the approach is primarily designed for such settings, this should be explicitly stated and motivated. 1. Unsu
1. GThinker demonstrates superior SOTA performance on a wide variety of benchmarks, impressively generalizing across mathematics, science, and general-domain tasks 2. The "Judge-Guided Selective Training" is a novel training strategy. Using a judge to train selectively on failure cases (especially visual-based failures) is interesting. 3. The proposed GThinker-11k dataset is a valuable contribution to the community 4. nice and valuable ablation on the iterative data refinement
1. **Need for Clearer Substantiation of the Core Motivational Claims:** The paper's motivation hinges on two key assertions: (a) that general multimodal reasoning is more "reliant on visual interpretation" than math/science tasks, and (b) that MLLMs "uncritically accept initial visual interpretations" while being capable of correcting textual context. While these claims are intuitively plausible, the introduction presents them as foundational axioms without direct preliminary experiments or cita
1. This paper excels in its conceptual introduction and methodological exposition. It clearly identifies a core asymmetry in multimodal reasoning and visually illustrates (e.g., in Figures 1, 2, and 3) the Cue-Rethinking pattern, its three-stage process, and the training pipeline in detail. This thorough and comprehensible presentation allows readers to quickly grasp GThinker's central innovative ideas and encourages further research. 2. GThinker's training framework demonstrates good reproduci
1. The generalizability of the training data construction raises concerns. The study relies on samples extracted from and corrected based on the error patterns of a specific model (e.g., Qwen-VL) to guide the learning of "adaptive visual rethinking" behavior. This approach may lead to training data that is overly tailored to the specific deficiencies of the source model, thereby limiting the effective transferability of the GThinker strategy to other multimodal models (e.g., LLaVA) which possess
1. GThinker proposes a gap in MLLMs: uncritical acceptance of initial visual cues despite textual reflection capabilities. This provides a fresh perspective for multimodal reasoning research. 2. The Cue-Rethinking pattern innovatively combines free-form reasoning with visual cue grounding <vcues>, enabling flexible rethinking without rigid templates. 3. Iterative annotation pipeline generates high-quality data (GThinker-11k), mitigating hallucination through flawed sampling-contrast-generation
1. The claim that existing models “ignore key visual cues” lacks quantitative support. Empirical analysis (e.g., error-case statistics) is needed to validate this asymmetry. 2. Experiments are restricted to Qwen2.5-VL-7B. Validation on larger/alternative architectures can be explored to ensure generalizability. 3. Typo: Table 1 shows that InternVL2.5-MPO-8B is the best model on Commonsense Soc. (82.6).
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Speech and dialogue systems
