TL;DR
V-tableR1 introduces a process-supervised reinforcement learning framework that enhances multimodal large language models' reasoning on tables by providing step-level visual feedback and a novel optimization algorithm, achieving state-of-the-art accuracy.
Contribution
It presents V-tableR1, a new RL framework with a critic-guided policy optimization method that improves verifiable reasoning in multimodal models on tabular data.
Findings
V-tableR1 outperforms larger models on tabular benchmarks.
It reduces visual hallucinations and shortcut reasoning.
Achieves state-of-the-art accuracy among open-source models.
Abstract
We introduce V-tableR1, a process-supervised reinforcement learning framework that elicits rigorous, verifiable reasoning from multimodal large language models (MLLMs). Current MLLMs trained solely on final outcomes often treat visual reasoning as a black box, relying on superficial pattern matching rather than performing rigorous multi-step inference. While Reinforcement Learning with Verifiable Rewards could enforce transparent reasoning trajectories, extending it to visual domains remains severely hindered by the ambiguity of grounding abstract logic into continuous pixel space. We solve this by leveraging the deterministic grid structure of tables as an ideal visual testbed. V-tableR1 employs a specialized critic VLM to provide dense, step-level feedback on the explicit visual chain-of-thought generated by a policy VLM. To optimize this system, we propose Process-Guided Direct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
