Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers
Ian Chuang, Jinyu Zou, Andrew Lee, Dechen Gao, Iman Soltani

TL;DR
This paper introduces a human-inspired foveated vision system for robots that improves efficiency and robustness by integrating gaze information into Vision Transformers, reducing computation and enhancing task performance.
Contribution
We develop GIAVA, a robot vision system that emulates human gaze and foveation, and demonstrate its benefits in efficiency, robustness, and high-precision task success.
Findings
Foveated vision reduces computational load significantly.
Foveated vision improves robustness against background distractors.
In some tasks, foveated vision increases success rates.
Abstract
Human vision is a highly active process driven by gaze, which directs attention to task-relevant regions through foveation, dramatically reducing visual processing. In contrast, robot learning systems typically rely on passive, uniform processing of raw camera images. In this work, we explore how incorporating human-like active gaze into robotic policies can enhance efficiency and robustness. We develop GIAVA (Gaze Integrated Active-Vision ALOHA), a robot vision system that emulates human head and neck movement, and gaze adjustment for foveated processing. Extending the AV-ALOHA robot platform, we introduce a framework for simultaneously collecting eye-tracking, perspective control, and robot manipulation demonstration data from a human operator. We also open-source a simulation benchmark and dataset for training robot policies that incorporate human gaze. Inspired by recent work in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaze Tracking and Assistive Technology · Visual Attention and Saliency Detection · Face Recognition and Perception
