CodePercept: Code-Grounded Visual STEM Perception for MLLMs
Tongkun Guan, Zhibo Yang, Jianqiang Wan, Mingkun Yang, Zhengtao Guo, Zijian Hu, Ruilin Luo, Ruize Chen, Songtao Jiang, Peng Wang, Wei Shen, Junyang Lin, Xiaokang Yang

TL;DR
CodePercept introduces a novel approach to enhance STEM visual perception in multimodal large language models by leveraging executable code as a perceptual medium, supported by a large dataset and a new evaluation benchmark.
Contribution
It systematically demonstrates perception as the key bottleneck in STEM visual reasoning and proposes code-grounded perception methods with a new dataset and benchmark.
Findings
Perception scaling outperforms reasoning scaling in STEM visual tasks.
Code-grounded captioning reduces hallucinations in image descriptions.
The new benchmark enables deterministic evaluation of visual perception in STEM.
Abstract
When MLLMs fail at Science, Technology, Engineering, and Mathematics (STEM) visual reasoning, a fundamental question arises: is it due to perceptual deficiencies or reasoning limitations? Through systematic scaling analysis that independently scales perception and reasoning components, we uncover a critical insight: scaling perception consistently outperforms scaling reasoning. This reveals perception as the true lever limiting current STEM visual reasoning. Motivated by this insight, our work focuses on systematically enhancing the perception capabilities of MLLMs by establishing code as a powerful perceptual medium--executable code provides precise semantics that naturally align with the structured nature of STEM visuals. Specifically, we construct ICC-1M, a large-scale dataset comprising 1M Image-Caption-Code triplets that materializes this code-as-perception paradigm through two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Cell Image Analysis Techniques
