Visually-grounded Humanoid Agents

Hang Ye; Xiaoxuan Ma; Fan Lu; Wayne Wu; Kwan-Yee Lin; and Yizhou Wang

arXiv:2604.08509·cs.CV·April 10, 2026

Visually-grounded Humanoid Agents

Hang Ye, Xiaoxuan Ma, Fan Lu, Wayne Wu, Kwan-Yee Lin, and Yizhou Wang

PDF

TL;DR

This paper introduces visually-grounded humanoid agents capable of autonomous, goal-directed behaviors in 3D environments using visual observations, semantic scene reconstruction, and embodied planning.

Contribution

It presents a novel two-layer paradigm combining scene reconstruction and autonomous humanoid agents with perception and planning capabilities.

Findings

01

Agents achieve higher task success rates than baselines.

02

Agents exhibit fewer collisions in diverse environments.

03

The approach enables scalable, active digital human populations.

Abstract

Digital human generation has been studied for decades and supports a wide range of real-world applications. However, most existing systems are passively animated, relying on privileged state or scripted control, which limits scalability to novel environments. We instead ask: how can digital humans actively behave using only visual observations and specified goals in novel scenes? Achieving this would enable populating any 3D environments with digital humans at scale that exhibit spontaneous, natural, goal-directed behaviors. To this end, we introduce Visually-grounded Humanoid Agents, a coupled two-layer (world-agent) paradigm that replicates humans at multiple levels: they look, perceive, reason, and behave like real people in real-world 3D scenes. The World Layer reconstructs semantically rich 3D Gaussian scenes from real-world videos via an occlusion-aware pipeline and accommodates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.