Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception
Jiashu Yang, Yifan Han, Yucheng Xie, Ning Guo, Wenzhao Lian

TL;DR
This paper introduces EyeVLA, a unified vision-language-action model enabling a PTZ camera to actively perceive and acquire informative views based on natural language instructions, using a data-efficient training pipeline.
Contribution
The paper presents a novel hierarchical action encoding and a reinforcement learning pipeline that transfers open-world understanding to embodied perception with minimal real-world data.
Findings
Achieves 96% task completion rate across diverse scenes.
Effectively integrates perception, language understanding, and physical control.
Demonstrates data-efficient training with only 500 real-world samples.
Abstract
In embodied AI, visual perception should be active rather than passive: the system must decide where to look and at what scale to sense to acquire maximally informative data under pixel and spatial budget constraints. Existing vision models coupled with fixed RGB-D cameras fundamentally fail to reconcile wide-area coverage with fine-grained detail acquisition, severely limiting their efficacy in open-world robotic applications. We study the task of language-guided active visual perception: given a single RGB image and a natural language instruction, the agent must output pan, tilt, and zoom adjustments of a real PTZ (pan-tilt-zoom) camera to acquire the most informative view for the specified task. We propose EyeVLA, a unified framework that addresses this task by integrating visual perception, language understanding, and physical camera control within a single autoregressive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
