Intent at a Glance: Gaze-Guided Robotic Manipulation via Foundation Models

Tracey Yee Hsin Tay; Xu Yan; Jonathan Ouyang; Daniel Wu; William Jiang; Jonathan Kao; Yuchen Cui

arXiv:2601.05336·cs.RO·January 12, 2026

Intent at a Glance: Gaze-Guided Robotic Manipulation via Foundation Models

Tracey Yee Hsin Tay, Xu Yan, Jonathan Ouyang, Daniel Wu, William Jiang, Jonathan Kao, Yuchen Cui

PDF

Open Access

TL;DR

This paper introduces GAMMA, a gaze-guided robotic manipulation system that uses foundation models to interpret eye gaze and scene context, enabling intuitive and autonomous robot control in manipulation tasks.

Contribution

The work presents a novel system that combines ego-centric gaze tracking with vision-language models to infer user intent and control robots without task-specific training.

Findings

01

GAMMA achieves robust and intuitive control in tabletop manipulation tasks.

02

The system generalizes well across different tasks without retraining.

03

GAMMA outperforms baseline gaze control methods in user studies.

Abstract

Designing intuitive interfaces for robotic control remains a central challenge in enabling effective human-robot interaction, particularly in assistive care settings. Eye gaze offers a fast, non-intrusive, and intent-rich input modality, making it an attractive channel for conveying user goals. In this work, we present GAMMA (Gaze Assisted Manipulation for Modular Autonomy), a system that leverages ego-centric gaze tracking and a vision-language model to infer user intent and autonomously execute robotic manipulation tasks. By contextualizing gaze fixations within the scene, the system maps visual attention to high-level semantic understanding, enabling skill selection and parameterization without task-specific training. We evaluate GAMMA on a range of table-top manipulation tasks and compare it against baseline gaze-based control without reasoning. Results demonstrate that GAMMA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaze Tracking and Assistive Technology · Social Robot Interaction and HRI · Visual Attention and Saliency Detection