VER: Scaling On-Policy RL Leads to the Emergence of Navigation in Embodied Rearrangement
Erik Wijmans, Irfan Essa, Dhruv Batra

TL;DR
The paper introduces Variable Experience Rollout (VER), a novel RL scaling technique that combines synchronous and asynchronous methods, leading to significant speed-ups and enabling emergent navigation skills in embodied AI tasks.
Contribution
VER is a new method that efficiently scales on-policy RL across multiple GPUs without synchronization, improving training speed and enabling emergent navigation behaviors.
Findings
VER achieves 1.6-2x speedup over DD-PPO in navigation tasks.
VER is 2.5-2.7x faster than DD-PPO in mobile manipulation tasks.
Navigation skills emerge unexpectedly in non-navigation training scenarios.
Abstract
We present Variable Experience Rollout (VER), a technique for efficiently scaling batched on-policy reinforcement learning in heterogenous environments (where different environments take vastly different times to generate rollouts) to many GPUs residing on, potentially, many machines. VER combines the strengths of and blurs the line between synchronous and asynchronous on-policy RL methods (SyncOnRL and AsyncOnRL, respectively). VER learns from on-policy experience (like SyncOnRL) and has no synchronization points (like AsyncOnRL). VER leads to significant and consistent speed-ups across a broad range of embodied navigation and mobile manipulation tasks in photorealistic 3D simulation environments. Specifically, for PointGoal navigation and ObjectGoal navigation in Habitat 1.0, VER is 60-100% faster (1.6-2x speedup) than DD-PPO, the current state of art distributed SyncOnRL, with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Robot Manipulation and Learning · Human Pose and Action Recognition
MethodsBalanced Selection · Decentralized Distributed Proximal Policy Optimization · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
