Populate-A-Scene: Affordance-Aware Human Video Generation
Mengyi Shan, Zecheng He, Haoyu Ma, Felix Juefei-Xu, Peizhao Zhang, Tingbo Hou, Ching-Yao Chuang

TL;DR
This paper introduces Populate-A-Scene, a method that fine-tunes text-to-video models to generate human actions in scenes based on affordance perception, enabling interactive scene understanding without explicit annotations.
Contribution
It demonstrates that pre-trained video models can infer human-environment affordances from a single scene image without explicit labels, advancing interactive scene generation.
Findings
Model can insert humans into scenes with coherent behavior.
Affordance inference is achieved without labeled datasets.
Cross-attention heatmaps reveal inherent affordance perception.
Abstract
Can a video generation model be repurposed as an interactive world simulator? We explore the affordance perception potential of text-to-video models by teaching them to predict human-environment interaction. Given a scene image and a prompt describing human actions, we fine-tune the model to insert a person into the scene, while ensuring coherent behavior, appearance, harmonization, and scene affordance. Unlike prior work, we infer human affordance for video generation (i.e., where to insert a person and how they should behave) from a single scene image, without explicit conditions like bounding boxes or body poses. An in-depth study of cross-attention heatmaps demonstrates that we can uncover the inherent affordance perception of a pre-trained video model without labeled affordance datasets.
Peer Reviews
Decision·Submitted to ICLR 2026
-The paper tackles an interesting and relevant problem of affordance-aware human-scene video generation. -The paper provides a straightforward extension to condition a pretrained text-to-video model on scene images. -Qualitative examples are visually appealing and demonstrate some degree of human-scene interaction.
-The paper mainly fine-tunes an existing model (e.g., MovieGen) with additional scene conditioning. The architectural modifications (latent concatenation + text-image fusion) appear incremental. -The affordance aspect, which the paper claims as an essential contribution, seems more like a re-interpretation of what attention maps already provide, rather than a fundamentally new capability. -The use of cross-attention heatmaps as evidence of affordance perception is weak. These maps do not prov
- Revealing Latent Capabilities of T2V Models: The paper provides a valuable scientific insight by demonstrating that large, pre-trained text-to-video models implicitly learn about affordances. The analysis of cross-attention maps (Fig. 4) convincingly shows that the model can associate action words (e.g., "riding," "holding") with the correct interactable regions in a scene (e.g., a horse, reins), even without being explicitly trained on affordance-labeled data. - Scalable and Automated Data
- Motivation: The paper's motivation is not sufficiently compelling. While it frames the work as creating a "simulator," it lacks any explicit design for modeling physical dynamics. This terminology seems to overstate the model's capabilities, which are more focused on semantic plausibility than physical simulation. Furthermore, the core problem of generating human-scene interactions from an empty scene is already handled well by several state-of-the-art foundation models, which can produce ph
Clarity: The paper is well-written. The task, method, and contributions are communicated with great clarity. Figure 1 provides an immediate and impressive overview of the model's capabilities. Figure 2 (data pipeline) and Figure 5 (affordance analysis) are both highly effective at explaining the core technical ideas and a key findingData Pipeline Biases: The clever data generation pipeline is also a potential source of weakness. Inpainting Artifacts: The model is trained to place a human into a
Should this really be a video synthesis paper? I looked at the videos in the results - there is barely any human motion in lots of the videos - for example a video is generated with a person standing next to a car and the camera moves? is this really useful or should be claimed as video synthesis? Perhaps a more palatable claim would have been to show the model generates images with correct affordances - correct placement of humans etc because thats what it appears to me for most cases. Also
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Face recognition and analysis
