SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning

Kun Xiang; Terry Jingchen Zhang; Zirong Liu; Bokai Zhou; Yueling Tang; Junjie Yu; Jiacong Lu; Shangrui Huang; Heng Li; Likui Zhang; Kunkun Liu; Changzheng Zhang; Yangle Fang; Boqiang Guo; Hui-Ling Zhen; Dandan Tu; Yinya Huang; Xiaodan Liang

arXiv:2605.09266·cs.AI·May 13, 2026

SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning

Kun Xiang, Terry Jingchen Zhang, Zirong Liu, Bokai Zhou, Yueling Tang, Junjie Yu, Jiacong Lu, Shangrui Huang, Heng Li, Likui Zhang, Kunkun Liu, Changzheng Zhang, Yangle Fang, Boqiang Guo, Hui-Ling Zhen, Dandan Tu, Yinya Huang, Xiaodan Liang

PDF

1 Repo 4 Datasets

TL;DR

SeePhys Pro introduces a benchmark for evaluating how well multimodal models maintain reasoning capabilities when information shifts from text to images, revealing current models' limitations and the effects of blind training.

Contribution

The paper presents a new modality transfer benchmark and analyzes the impact of blind training on multimodal reasoning robustness.

Findings

01

Models' performance drops as information moves from language to diagrams.

02

Blind training with masked images can improve performance without visual evidence.

03

Residual cues, not visual evidence, may drive some performance gains.

Abstract

We introduce SeePhys Pro, a fine-grained modality transfer benchmark that studies whether models preserve the same reasoning capability when critical information is progressively transferred from text to image. Unlike standard vision-essential benchmarks that evaluate a single input form, SeePhys Pro features four semantically aligned variants for each problem with progressively increasing visual elements. Our evaluation shows that current frontier models are far from representation-invariant reasoners: performance degrades on average as information moves from language to diagrams, with visual variable grounding as the most critical bottleneck. Motivated by this inference-time fragility, we further develop large training corpora for multimodal RLVR and use blind training as a diagnostic control, finding that RL with all training images masked can still improve performance on unmasked…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ai4phys/SeePhy-Pro
github

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.