CrowdVLA: Embodied Vision-Language-Action Agents for Context-Aware Crowd Simulation

Juyeong Hwang; Seong-Eun Hong; Jinhyun Kim; JaeYoung Seon; Giljoo Nam; Hanyoung Jang; HyeongYeop Kang

arXiv:2604.05525·cs.GR·April 8, 2026

CrowdVLA: Embodied Vision-Language-Action Agents for Context-Aware Crowd Simulation

Juyeong Hwang, Seong-Eun Hong, Jinhyun Kim, JaeYoung Seon, Giljoo Nam, Hanyoung Jang, HyeongYeop Kang

PDF

TL;DR

CrowdVLA introduces a novel crowd simulation approach where agents interpret scene semantics and social norms through vision and language, enabling more meaningful and context-aware pedestrian behaviors.

Contribution

The paper presents CrowdVLA, a new framework that models pedestrians as vision-language-action agents capable of consequence-aware decision making in crowd simulation.

Findings

01

Agents interpret scene semantics and social norms from visual observations.

02

CrowdVLA enables consequence-aware decision making through simulation rollouts.

03

The approach shifts crowd simulation from motion-centric to perception-driven behaviors.

Abstract

Crowds do not merely move; they decide. Human navigation is inherently contextual: people interpret the meaning of space, social norms, and potential consequences before acting. Sidewalks invite walking, crosswalks invite crossing, and deviations are weighed against urgency and safety. Yet most crowd simulation methods reduce navigation to geometry and collision avoidance, producing motion that is plausible but rarely intentional. We introduce CrowdVLA, a new formulation of crowd simulation that models each pedestrian as a Vision-Language-Action (VLA) agent. Instead of replaying recorded trajectories, CrowdVLA enables agents to interpret scene semantics and social norms from visual observations and language instructions, and to select actions through consequence-aware reasoning. CrowdVLA addresses three key challenges-limited agent-centric supervision in crowd datasets, unstable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.