CrowdVLA: Embodied Vision-Language-Action Agents for Context-Aware Crowd Simulation
Juyeong Hwang, Seong-Eun Hong, Jinhyun Kim, JaeYoung Seon, Giljoo Nam, Hanyoung Jang, HyeongYeop Kang

TL;DR
CrowdVLA introduces a novel crowd simulation approach where agents interpret scene semantics and social norms through vision and language, enabling more meaningful and context-aware pedestrian behaviors.
Contribution
The paper presents CrowdVLA, a new framework that models pedestrians as vision-language-action agents capable of consequence-aware decision making in crowd simulation.
Findings
Agents interpret scene semantics and social norms from visual observations.
CrowdVLA enables consequence-aware decision making through simulation rollouts.
The approach shifts crowd simulation from motion-centric to perception-driven behaviors.
Abstract
Crowds do not merely move; they decide. Human navigation is inherently contextual: people interpret the meaning of space, social norms, and potential consequences before acting. Sidewalks invite walking, crosswalks invite crossing, and deviations are weighed against urgency and safety. Yet most crowd simulation methods reduce navigation to geometry and collision avoidance, producing motion that is plausible but rarely intentional. We introduce CrowdVLA, a new formulation of crowd simulation that models each pedestrian as a Vision-Language-Action (VLA) agent. Instead of replaying recorded trajectories, CrowdVLA enables agents to interpret scene semantics and social norms from visual observations and language instructions, and to select actions through consequence-aware reasoning. CrowdVLA addresses three key challenges-limited agent-centric supervision in crowd datasets, unstable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
