Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers

Ian Chuang; Jinyu Zou; Andrew Lee; Dechen Gao; Iman Soltani

arXiv:2507.15833·cs.RO·September 23, 2025

Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers

Ian Chuang, Jinyu Zou, Andrew Lee, Dechen Gao, Iman Soltani

PDF

Open Access

TL;DR

This paper introduces a human-inspired foveated vision system for robots that improves efficiency and robustness by integrating gaze information into Vision Transformers, reducing computation and enhancing task performance.

Contribution

We develop GIAVA, a robot vision system that emulates human gaze and foveation, and demonstrate its benefits in efficiency, robustness, and high-precision task success.

Findings

01

Foveated vision reduces computational load significantly.

02

Foveated vision improves robustness against background distractors.

03

In some tasks, foveated vision increases success rates.

Abstract

Human vision is a highly active process driven by gaze, which directs attention to task-relevant regions through foveation, dramatically reducing visual processing. In contrast, robot learning systems typically rely on passive, uniform processing of raw camera images. In this work, we explore how incorporating human-like active gaze into robotic policies can enhance efficiency and robustness. We develop GIAVA (Gaze Integrated Active-Vision ALOHA), a robot vision system that emulates human head and neck movement, and gaze adjustment for foveated processing. Extending the AV-ALOHA robot platform, we introduce a framework for simultaneously collecting eye-tracking, perspective control, and robot manipulation demonstration data from a human operator. We also open-source a simulation benchmark and dataset for training robot policies that incorporate human gaze. Inspired by recent work in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaze Tracking and Assistive Technology · Visual Attention and Saliency Detection · Face Recognition and Perception