ARM: Advantage Reward Modeling for Long-Horizon Manipulation
Yiming Mao, Zixi Yu, Weixin Mao, Yinhao Li, Qirui Hu, Zihan Lan, Minzhao Zhu, Hua Chen

TL;DR
ARM introduces a reward modeling framework that estimates relative advantage using a tri-state labeling strategy, improving long-horizon manipulation with minimal human effort and enhanced data efficiency.
Contribution
The paper presents Advantage Reward Modeling (ARM), a novel approach that replaces absolute progress with advantage estimation and a tri-state labeling strategy for better RL in complex tasks.
Findings
Achieved 99.4% success rate on towel-folding task.
Enabled stable and data-efficient policy training with minimal human intervention.
Improved over current VLA baselines in long-horizon manipulation.
Abstract
Long-horizon robotic manipulation remains challenging for reinforcement learning (RL) because sparse rewards provide limited guidance for credit assignment. Practical policy improvement thus relies on richer intermediate supervision, such as dense progress rewards, which are costly to obtain and ill-suited to non-monotonic behaviors such as backtracking and recovery. To address this, we propose Advantage Reward Modeling (ARM), a framework that shifts from hard-to-quantify absolute progress to estimating relative advantage. We introduce a cost-effective tri-state labeling strategy -- Progressive, Regressive, and Stagnant -- that reduces human cognitive overhead while ensuring high cross-annotator consistency. By training on these intuitive signals, ARM enables automated progress annotation for both complete demonstrations and fragmented DAgger-style data. Integrating ARM into an offline…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
