TL;DR
This paper introduces DataPRM, a process-level reward model for data analysis agents that detects silent errors and improves policy learning, outperforming baselines with only 4B parameters.
Contribution
The work presents DataPRM, a novel environment-aware generative reward model that actively verifies intermediate states and distinguishes error types, advancing process reward modeling in dynamic data analysis.
Findings
DataPRM improves downstream policy LLMs by 7.21% on ScienceAgentBench.
DataPRM achieves 11.28% improvement on DABStep with Best-of-N inference.
DataPRM outperforms strong baselines with only 4B parameters and generalizes across strategies.
Abstract
Process Reward Models (PRMs) have achieved remarkable success in augmenting the reasoning capabilities of Large Language Models (LLMs) within static domains such as mathematics. However, their potential in dynamic data analysis tasks remains underexplored. In this work, we first present a empirical study revealing that general-domain PRMs struggle to supervise data analysis agents. Specifically, they fail to detect silent errors, logical flaws that yield incorrect results without triggering interpreter exceptions, and erroneously penalize exploratory actions, mistaking necessary trial-and-error exploration for grounding failures. To bridge this gap, we introduce DataPRM, a novel environment-aware generative process reward model that (1) can serve as an active verifier, autonomously interacting with the environment to probe intermediate execution states and uncover silent errors, and (2)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
