EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models

Yu Bai; MingMing Yu; Chaojie Li; Ziyi Bai; Xinlong Wang; B\"orje F. Karlsson

arXiv:2602.04515·cs.RO·February 5, 2026

EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models

Yu Bai, MingMing Yu, Chaojie Li, Ziyi Bai, Xinlong Wang, B\"orje F. Karlsson

PDF

Open Access 2 Models

TL;DR

EgoActor introduces a vision-language model that grounds high-level instructions into precise, spatially aware humanoid actions, enabling real-time, robust task execution in complex environments.

Contribution

The paper presents EgoActor, a scalable VLM that integrates perception and action grounding for humanoid robots, addressing real-world challenges with egocentric data and spatial reasoning.

Findings

01

EgoActor performs real-time action inference under 1 second.

02

The model generalizes across diverse tasks and unseen environments.

03

EgoActor effectively bridges task planning and motor execution.

Abstract

Deploying humanoid robots in real-world settings is fundamentally challenging, as it demands tight integration of perception, locomotion, and manipulation under partial-information observations and dynamically changing environments. As well as transitioning robustly between sub-tasks of different types. Towards addressing these challenges, we propose a novel task - EgoActing, which requires directly grounding high-level instructions into various, precise, spatially aware humanoid actions. We further instantiate this task by introducing EgoActor, a unified and scalable vision-language model (VLM) that can predict locomotion primitives (e.g., walk, turn, move sideways, change height), head movements, manipulation commands, and human-robot interactions to coordinate perception and execution in real-time. We leverage broad supervision over egocentric RGB-only data from real-world…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSocial Robot Interaction and HRI · Multimodal Machine Learning Applications · Robot Manipulation and Learning