PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators
Kuo-Hao Zeng, Zichen Zhang, Kiana Ehsani, Rose Hendrix, Jordi, Salvador, Alvaro Herrasti, Ross Girshick, Aniruddha Kembhavi, Luca Weihs

TL;DR
PoliFormer is a scalable, transformer-based indoor navigation system trained entirely in simulation that generalizes well to real-world robots, achieving state-of-the-art success rates and versatile downstream capabilities.
Contribution
The paper introduces PoliFormer, a novel transformer-based RL navigation agent trained at scale, demonstrating superior real-world generalization and multi-task adaptability without finetuning.
Findings
Achieves 85.5% success in object goal navigation on CHORES-S.
Outperforms previous methods by 28.5% in success rate.
Extensible to multiple downstream navigation tasks.
Abstract
We present PoliFormer (Policy Transformer), an RGB-only indoor navigation agent trained end-to-end with reinforcement learning at scale that generalizes to the real-world without adaptation despite being trained purely in simulation. PoliFormer uses a foundational vision transformer encoder with a causal transformer decoder enabling long-term memory and reasoning. It is trained for hundreds of millions of interactions across diverse environments, leveraging parallelized, multi-machine rollouts for efficient training with high throughput. PoliFormer is a masterful navigator, producing state-of-the-art results across two distinct embodiments, the LoCoBot and Stretch RE-1 robots, and four navigation benchmarks. It breaks through the plateaus of previous work, achieving an unprecedented 85.5% success rate in object goal navigation on the CHORES-S benchmark, a 28.5% absolute improvement.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Games · Multi-Agent Systems and Negotiation
MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · Dense Connections · Residual Connection · Multi-Head Attention · Vision Transformer
