Visual-ERM: Reward Modeling for Visual Equivalence

Ziyu Liu; Shengyuan Ding; Xinyu Fang; Xuanlang Dai; Penghui Yang; Jianze Liang; Jiaqi Wang; Kai Chen; Dahua Lin; Yuhang Zang

arXiv:2603.13224·cs.CV·May 12, 2026

Visual-ERM: Reward Modeling for Visual Equivalence

Ziyu Liu, Shengyuan Ding, Xinyu Fang, Xuanlang Dai, Penghui Yang, Jianze Liang, Jiaqi Wang, Kai Chen, Dahua Lin, Yuhang Zang

PDF

1 Repo 1 Models 1 Datasets

TL;DR

This paper introduces Visual-ERM, a multimodal reward model that provides fine-grained visual feedback for vision-to-code tasks, improving reinforcement learning performance and establishing a new benchmark for visual equivalence evaluation.

Contribution

The paper presents Visual-ERM, a novel reward model that enhances vision-to-code RL by capturing detailed visual discrepancies and introduces VC-RewardBench, a benchmark for structured visual data comparison.

Findings

01

Visual-ERM improves RL performance on chart-to-code (+8.4) and other visual parsing tasks.

02

Visual-ERM outperforms existing reward methods in fine-grained visual discrepancy detection.

03

The benchmark VC-RewardBench effectively evaluates visual equivalence in structured visual data.

Abstract

Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement learning remains challenging due to misaligned reward signals. Existing rewards either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking. We propose Visual Equivalence Reward Model (Visual-ERM), a multimodal generative reward model that provides fine-grained, interpretable, and task-agnostic feedback to evaluate vision-to-code quality directly in the rendered visual space. Integrated into RL, Visual-ERM improves Qwen3-VL-8B-Instruct by +8.4 on chart-to-code and yields consistent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

internlm/Visual-ERM
github

Models

🤗
internlm/Visual-ERM
model· 23 dl· ♡ 10
23 dl♡ 10

Datasets

internlm/VC-RewardBench
dataset· 295 dl
295 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.