GenHSI: Controllable Generation of Human-Scene Interaction Videos

Zekun Li; Rui Zhou; Rahul Sajnani; Xiaoyan Cong; Daniel Ritchie; Srinath Sridhar

arXiv:2506.19840·cs.CV·April 20, 2026

GenHSI: Controllable Generation of Human-Scene Interaction Videos

Zekun Li, Rui Zhou, Rahul Sajnani, Xiaoyan Cong, Daniel Ritchie, Srinath Sridhar

PDF

TL;DR

GenHSI is a training-free, three-stage method that generates long, controllable human-scene interaction videos with 3D awareness, preserving identity and interaction plausibility from a single scene image.

Contribution

It introduces a novel, training-free approach for long HSI video synthesis using script writing, pre-visualization, and animation stages, leveraging 2D inpainting and 3D optimization.

Findings

01

Successfully generates long HSI videos with preserved identity.

02

Produces plausible human interactions and dynamics in 3D-aware videos.

03

Operates without training, using a three-stage pipeline.

Abstract

Large-scale pre-trained video diffusion models have exhibited remarkable capabilities in diverse video generation. However, existing solutions face several challenges in generating long videos with rich human-scene interactions (HSI), including unrealistic dynamics and affordance, lack of subject identity preservation, and the need for expensive training. To this end, we propose GenHSI, a training-free method for controllable generation of long HSI videos with 3D awareness. Taking inspiration from movie animation, we subdivide the video synthesis into three stages: (1) script writing, (2) pre-visualization, and (3) animation. Given an image of a scene and a character with a user description, we use these three stages to generate long videos that preserve human identity and provide rich and plausible HSI. Script writing converts a complex text prompt involving a chain of HSI into simple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.