Intent at a Glance: Gaze-Guided Robotic Manipulation via Foundation Models
Tracey Yee Hsin Tay, Xu Yan, Jonathan Ouyang, Daniel Wu, William Jiang, Jonathan Kao, Yuchen Cui

TL;DR
This paper introduces GAMMA, a gaze-guided robotic manipulation system that uses foundation models to interpret eye gaze and scene context, enabling intuitive and autonomous robot control in manipulation tasks.
Contribution
The work presents a novel system that combines ego-centric gaze tracking with vision-language models to infer user intent and control robots without task-specific training.
Findings
GAMMA achieves robust and intuitive control in tabletop manipulation tasks.
The system generalizes well across different tasks without retraining.
GAMMA outperforms baseline gaze control methods in user studies.
Abstract
Designing intuitive interfaces for robotic control remains a central challenge in enabling effective human-robot interaction, particularly in assistive care settings. Eye gaze offers a fast, non-intrusive, and intent-rich input modality, making it an attractive channel for conveying user goals. In this work, we present GAMMA (Gaze Assisted Manipulation for Modular Autonomy), a system that leverages ego-centric gaze tracking and a vision-language model to infer user intent and autonomously execute robotic manipulation tasks. By contextualizing gaze fixations within the scene, the system maps visual attention to high-level semantic understanding, enabling skill selection and parameterization without task-specific training. We evaluate GAMMA on a range of table-top manipulation tasks and compare it against baseline gaze-based control without reasoning. Results demonstrate that GAMMA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaze Tracking and Assistive Technology · Social Robot Interaction and HRI · Visual Attention and Saliency Detection
