Gaze-Regularized Vision-Language-Action Models for Robotic Manipulation
Anupam Pani, Yanchao Yang

TL;DR
This paper introduces a gaze-regularized training method for vision-language-action models in robotics, aligning model attention with human gaze to improve manipulation performance and interpretability without extra inference costs.
Contribution
The authors propose a novel gaze-regularization framework that enhances VLA models by aligning their attention with human gaze patterns, improving efficiency, robustness, and interpretability.
Findings
Achieved 4-12% performance improvements on manipulation benchmarks.
Models trained with gaze regularization require fewer training steps.
Attention patterns learned are interpretable and mirror human strategies.
Abstract
Despite advances in Vision-Language-Action (VLA) models, robotic manipulation struggles with fine-grained tasks because current models lack mechanisms for active visual attention allocation. Human gaze naturally encodes intent, planning, and execution patterns -- offering a powerful supervisory signal for guiding robot perception. We introduce a gaze-regularized training framework that aligns VLA models' internal attention with human visual patterns without architectural modifications or inference-time overhead. Our method transforms temporally aggregated gaze heatmaps into patch-level distributions and regularizes the transformer's attention through KL divergence, creating an inductive bias toward task-relevant features while preserving deployment efficiency. When integrated into existing VLA architectures, our approach yields 4-12% improvements across manipulation benchmarks. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
