Augmented Reality for RObots (ARRO): Pointing Visuomotor Policies Towards Visual Robustness
Reihaneh Mirjalili, Tobias J\"ulg, Florian Walter, Wolfram Burgard

TL;DR
ARRO introduces a zero-shot visual masking technique using open-vocabulary models to improve the robustness of visuomotor policies in robotic manipulation tasks across diverse environments.
Contribution
The paper presents ARRO, a novel visual representation that enhances robot policy robustness by real-time masking of irrelevant scene regions without additional training.
Findings
ARRO improves robustness to scene variations.
ARRO enables selective masking of objects.
ARRO shows consistent performance gains across tasks.
Abstract
Visuomotor policies trained on human expert demonstrations have recently shown strong performance across a wide range of robotic manipulation tasks. However, these policies remain highly sensitive to domain shifts stemming from background or robot embodiment changes, which limits their generalization capabilities. In this paper, we present ARRO, a novel visual representation that leverages zero-shot open-vocabulary segmentation and object detection models to efficiently mask out task-irrelevant regions of the scene in real time without requiring additional training, modeling of the setup, or camera calibration. By filtering visual distractors and overlaying virtual guides during both training and inference, ARRO improves robustness to scene variations and reduces the need for additional data collection. We extensively evaluate ARRO with Diffusion Policy on a range of tabletop…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
