GenHSI: Controllable Generation of Human-Scene Interaction Videos
Zekun Li, Rui Zhou, Rahul Sajnani, Xiaoyan Cong, Daniel Ritchie, Srinath Sridhar

TL;DR
GenHSI is a training-free, three-stage method that generates long, controllable human-scene interaction videos with 3D awareness, preserving identity and interaction plausibility from a single scene image.
Contribution
It introduces a novel, training-free approach for long HSI video synthesis using script writing, pre-visualization, and animation stages, leveraging 2D inpainting and 3D optimization.
Findings
Successfully generates long HSI videos with preserved identity.
Produces plausible human interactions and dynamics in 3D-aware videos.
Operates without training, using a three-stage pipeline.
Abstract
Large-scale pre-trained video diffusion models have exhibited remarkable capabilities in diverse video generation. However, existing solutions face several challenges in generating long videos with rich human-scene interactions (HSI), including unrealistic dynamics and affordance, lack of subject identity preservation, and the need for expensive training. To this end, we propose GenHSI, a training-free method for controllable generation of long HSI videos with 3D awareness. Taking inspiration from movie animation, we subdivide the video synthesis into three stages: (1) script writing, (2) pre-visualization, and (3) animation. Given an image of a scene and a character with a user description, we use these three stages to generate long videos that preserve human identity and provide rich and plausible HSI. Script writing converts a complex text prompt involving a chain of HSI into simple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
