CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving

Shuhang Chen; Yunqiu Xu; Junjie Xie; Aojun Lu; Tao Feng; Zeying Huang; Ning Zhang; Yi Sun; Yi Yang; Hangjie Yuan

arXiv:2601.01874·cs.CV·February 25, 2026

CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving

Shuhang Chen, Yunqiu Xu, Junjie Xie, Aojun Lu, Tao Feng, Zeying Huang, Ning Zhang, Yi Sun, Yi Yang, Hangjie Yuan

PDF

Open Access 3 Reviews

TL;DR

CogFlow introduces a three-stage, knowledge-internalization framework for visual mathematical problem solving, enhancing perception, integration, and reasoning to improve model accuracy and faithfulness in visual reasoning tasks.

Contribution

The paper proposes a novel hierarchical framework with a knowledge internalization stage, new reward models, and a visual-gated policy optimization, advancing visual mathematical reasoning.

Findings

01

Outperforms existing models on visual mathematical reasoning benchmarks.

02

Enhances perception and reasoning fidelity through new reward mechanisms.

03

Provides a large, annotated dataset for training and evaluation.

Abstract

Despite significant progress, multimodal large language models continue to struggle with visual mathematical problem solving. Some recent works recognize that visual perception is a bottleneck in visual mathematical reasoning, but their solutions are limited to improving the extraction and interpretation of visual inputs. Notably, they all ignore the key issue of whether the extracted visual cues are faithfully integrated and properly utilized in subsequent reasoning. Motivated by this, we present CogFlow, a novel cognitive-inspired three-stage framework that incorporates a knowledge internalization stage, explicitly simulating the hierarchical flow of human reasoning: perception $\Rightarrow$ internalization $\Rightarrow$ reasoning. In line with this hierarchical flow, we holistically enhance all its stages. We devise Synergistic Visual Rewards to boost perception capabilities in…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

+ The paper proposes a new three-step framework for faithful visual mathematical reasoning. + It constructs a new dataset MATHCOG featured with three subset with each subset for a stage respectively. + Experiments on FlowVerse dataset and MathVerse dataset show substaintial improvement compared with baseline methods.

Weaknesses

+ Some important details are missing. For example, how to construct the contrastive pairs in MATHCOG-VAR dataset? How to incorporate the five common types of reasoning error? How is the visual anchor reward $R_{VAR}$ implemented? Is it a reward in VRPO algorithm or is it a training stage between SFT and VRPO? How iis the correctness and format reward $R_{IR}$ implemented? How to compute the visual parameterized reward in the parameter space if the predicted primitive and the ground-truth primiti

Reviewer 02Rating 6Confidence 4

Strengths

1. The paper addresses an important problem of the lack of perceptual understanding and reasoning drift in VLMs. 2. The proposed method is quite comprehensive where each component ensures that the model learns to perceive images accurately at the primitive level and the model is encouraged to utilize perception via additional rewards. 3. The experimental results suggest that the method works very well on the FlowVerse, MathVerse, and MathVista dataset. The ablation studies showcase the usefu

Weaknesses

1. Since the paper has too many moving parts, the paper could be written better. Despite decent experience in VL reasoning, I was finding it hard to keep up with a dense introduction with many jargons. The authors should think about how to portray their story in a more simplified manner. 2. The proposed method to get the primitives for shapes (e.g., Circle) does not seem scalable. How will you get primitives for shapes where the actual dimensions are absent like many natural scenes or even synt

Reviewer 03Rating 6Confidence 3

Strengths

1. The paper presents a conceptually clear and cognitively inspired framework that divides multimodal reasoning into perception, internalization, and reasoning stages, effectively bridging visual understanding and symbolic inference. 2. The paper demonstrates a well-organized methodological design, where the layered reward mechanisms are technically coherent and systematically integrated across different reasoning stages. 3. The paper show consistent improvements on three visual mathematical b

Weaknesses

1. The paper mainly reports results on MathVista, where the gains are substantial, but lacks systematic comparisons with other recent benchmarks such as MathVision, We-Math, and DynaMath. This limits the generality of the claimed performance advantage. 2. The paper evaluates only a single model scale (Qwen2.5-VL-7B) without testing different sizes or architectures, making it difficult to assess the scalability and broader applicability of the proposed framework. 3. The paper provides limited a

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Data Visualization and Analytics · Advanced Graph Neural Networks