Visual-ERM: Reward Modeling for Visual Equivalence
Ziyu Liu, Shengyuan Ding, Xinyu Fang, Xuanlang Dai, Penghui Yang, Jianze Liang, Jiaqi Wang, Kai Chen, Dahua Lin, Yuhang Zang

TL;DR
This paper introduces Visual-ERM, a multimodal reward model that provides fine-grained visual feedback for vision-to-code tasks, improving reinforcement learning performance and establishing a new benchmark for visual equivalence evaluation.
Contribution
The paper presents Visual-ERM, a novel reward model that enhances vision-to-code RL by capturing detailed visual discrepancies and introduces VC-RewardBench, a benchmark for structured visual data comparison.
Findings
Visual-ERM improves RL performance on chart-to-code (+8.4) and other visual parsing tasks.
Visual-ERM outperforms existing reward methods in fine-grained visual discrepancy detection.
The benchmark VC-RewardBench effectively evaluates visual equivalence in structured visual data.
Abstract
Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement learning remains challenging due to misaligned reward signals. Existing rewards either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking. We propose Visual Equivalence Reward Model (Visual-ERM), a multimodal generative reward model that provides fine-grained, interpretable, and task-agnostic feedback to evaluate vision-to-code quality directly in the rendered visual space. Integrated into RL, Visual-ERM improves Qwen3-VL-8B-Instruct by +8.4 on chart-to-code and yields consistent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
