CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving
Shuhang Chen, Yunqiu Xu, Junjie Xie, Aojun Lu, Tao Feng, Zeying Huang, Ning Zhang, Yi Sun, Yi Yang, Hangjie Yuan

TL;DR
CogFlow introduces a three-stage, knowledge-internalization framework for visual mathematical problem solving, enhancing perception, integration, and reasoning to improve model accuracy and faithfulness in visual reasoning tasks.
Contribution
The paper proposes a novel hierarchical framework with a knowledge internalization stage, new reward models, and a visual-gated policy optimization, advancing visual mathematical reasoning.
Findings
Outperforms existing models on visual mathematical reasoning benchmarks.
Enhances perception and reasoning fidelity through new reward mechanisms.
Provides a large, annotated dataset for training and evaluation.
Abstract
Despite significant progress, multimodal large language models continue to struggle with visual mathematical problem solving. Some recent works recognize that visual perception is a bottleneck in visual mathematical reasoning, but their solutions are limited to improving the extraction and interpretation of visual inputs. Notably, they all ignore the key issue of whether the extracted visual cues are faithfully integrated and properly utilized in subsequent reasoning. Motivated by this, we present CogFlow, a novel cognitive-inspired three-stage framework that incorporates a knowledge internalization stage, explicitly simulating the hierarchical flow of human reasoning: perceptioninternalizationreasoning. In line with this hierarchical flow, we holistically enhance all its stages. We devise Synergistic Visual Rewards to boost perception capabilities in…
Peer Reviews
Decision·ICLR 2026 Poster
+ The paper proposes a new three-step framework for faithful visual mathematical reasoning. + It constructs a new dataset MATHCOG featured with three subset with each subset for a stage respectively. + Experiments on FlowVerse dataset and MathVerse dataset show substaintial improvement compared with baseline methods.
+ Some important details are missing. For example, how to construct the contrastive pairs in MATHCOG-VAR dataset? How to incorporate the five common types of reasoning error? How is the visual anchor reward $R_{VAR}$ implemented? Is it a reward in VRPO algorithm or is it a training stage between SFT and VRPO? How iis the correctness and format reward $R_{IR}$ implemented? How to compute the visual parameterized reward in the parameter space if the predicted primitive and the ground-truth primiti
1. The paper addresses an important problem of the lack of perceptual understanding and reasoning drift in VLMs. 2. The proposed method is quite comprehensive where each component ensures that the model learns to perceive images accurately at the primitive level and the model is encouraged to utilize perception via additional rewards. 3. The experimental results suggest that the method works very well on the FlowVerse, MathVerse, and MathVista dataset. The ablation studies showcase the usefu
1. Since the paper has too many moving parts, the paper could be written better. Despite decent experience in VL reasoning, I was finding it hard to keep up with a dense introduction with many jargons. The authors should think about how to portray their story in a more simplified manner. 2. The proposed method to get the primitives for shapes (e.g., Circle) does not seem scalable. How will you get primitives for shapes where the actual dimensions are absent like many natural scenes or even synt
1. The paper presents a conceptually clear and cognitively inspired framework that divides multimodal reasoning into perception, internalization, and reasoning stages, effectively bridging visual understanding and symbolic inference. 2. The paper demonstrates a well-organized methodological design, where the layered reward mechanisms are technically coherent and systematically integrated across different reasoning stages. 3. The paper show consistent improvements on three visual mathematical b
1. The paper mainly reports results on MathVista, where the gains are substantial, but lacks systematic comparisons with other recent benchmarks such as MathVision, We-Math, and DynaMath. This limits the generality of the claimed performance advantage. 2. The paper evaluates only a single model scale (Qwen2.5-VL-7B) without testing different sizes or architectures, making it difficult to assess the scalability and broader applicability of the proposed framework. 3. The paper provides limited a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Data Visualization and Analytics · Advanced Graph Neural Networks
