EQA-RM: A Generative Embodied Reward Model with Test-time Scaling
Yuhang Chen, Zhen Tan, Tianlong Chen

TL;DR
EQA-RM is a novel generative reward model designed for embodied question answering tasks, offering interpretable feedback and test-time scaling, with high efficiency and strong performance on a new benchmark.
Contribution
The paper introduces EQA-RM, a generative multimodal reward model for EQA, trained with C-GRPO, and presents EQARewardBench for standardized evaluation.
Findings
EQA-RM achieves 61.9% accuracy with only 700 samples.
EQA-RM outperforms proprietary and open-source baselines.
Test-time scaling enables dynamic evaluation granularity.
Abstract
Reward Models (RMs), vital for large model alignment, are underexplored for complex embodied tasks like Embodied Question Answering (EQA) where nuanced evaluation of agents' spatial, temporal, and logical understanding is critical yet not considered by generic approaches. We introduce EQA-RM, a novel generative multimodal reward model specifically architected for EQA, trained via our innovative Contrastive Group Relative Policy Optimization (C-GRPO) strategy to learn fine-grained behavioral distinctions. The generative nature of EQA-RM provides interpretable, structured reward feedback (beyond simple scalars), uniquely enabling test-time scaling to dynamically adjust evaluation granularity, from concise scores to detailed critiques of reasoning and grounding, at inference without retraining. Concurrently, we introduce EQARewardBench, a new benchmark built on OpenEQA for standardized EQA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Machine Learning in Healthcare
