Gaze-Regularized Vision-Language-Action Models for Robotic Manipulation

Anupam Pani; Yanchao Yang

arXiv:2603.23202·cs.CV·April 8, 2026

Gaze-Regularized Vision-Language-Action Models for Robotic Manipulation

Anupam Pani, Yanchao Yang

PDF

TL;DR

This paper introduces a gaze-regularized training method for vision-language-action models in robotics, aligning model attention with human gaze to improve manipulation performance and interpretability without extra inference costs.

Contribution

The authors propose a novel gaze-regularization framework that enhances VLA models by aligning their attention with human gaze patterns, improving efficiency, robustness, and interpretability.

Findings

01

Achieved 4-12% performance improvements on manipulation benchmarks.

02

Models trained with gaze regularization require fewer training steps.

03

Attention patterns learned are interpretable and mirror human strategies.

Abstract

Despite advances in Vision-Language-Action (VLA) models, robotic manipulation struggles with fine-grained tasks because current models lack mechanisms for active visual attention allocation. Human gaze naturally encodes intent, planning, and execution patterns -- offering a powerful supervisory signal for guiding robot perception. We introduce a gaze-regularized training framework that aligns VLA models' internal attention with human visual patterns without architectural modifications or inference-time overhead. Our method transforms temporally aggregated gaze heatmaps into patch-level distributions and regularizes the transformer's attention through KL divergence, creating an inductive bias toward task-relevant features while preserving deployment efficiency. When integrated into existing VLA architectures, our approach yields 4-12% improvements across manipulation benchmarks. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.