EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models
Yu Bai, MingMing Yu, Chaojie Li, Ziyi Bai, Xinlong Wang, B\"orje F. Karlsson

TL;DR
EgoActor introduces a vision-language model that grounds high-level instructions into precise, spatially aware humanoid actions, enabling real-time, robust task execution in complex environments.
Contribution
The paper presents EgoActor, a scalable VLM that integrates perception and action grounding for humanoid robots, addressing real-world challenges with egocentric data and spatial reasoning.
Findings
EgoActor performs real-time action inference under 1 second.
The model generalizes across diverse tasks and unseen environments.
EgoActor effectively bridges task planning and motor execution.
Abstract
Deploying humanoid robots in real-world settings is fundamentally challenging, as it demands tight integration of perception, locomotion, and manipulation under partial-information observations and dynamically changing environments. As well as transitioning robustly between sub-tasks of different types. Towards addressing these challenges, we propose a novel task - EgoActing, which requires directly grounding high-level instructions into various, precise, spatially aware humanoid actions. We further instantiate this task by introducing EgoActor, a unified and scalable vision-language model (VLM) that can predict locomotion primitives (e.g., walk, turn, move sideways, change height), head movements, manipulation commands, and human-robot interactions to coordinate perception and execution in real-time. We leverage broad supervision over egocentric RGB-only data from real-world…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSocial Robot Interaction and HRI · Multimodal Machine Learning Applications · Robot Manipulation and Learning
