TL;DR
VHOI introduces a two-stage framework that densifies sparse human-object interaction trajectories into detailed masks and fine-tunes a diffusion model for realistic, controllable HOI video generation, incorporating novel motion representations.
Contribution
It presents a novel HOI-aware motion encoding and a two-stage densification and generation process for controllable HOI video synthesis.
Findings
Achieves state-of-the-art results in controllable HOI video generation.
Can generate full human navigation sequences leading to object interactions.
Demonstrates effectiveness of dense mask conditioning for realistic motion synthesis.
Abstract
Synthesizing realistic human-object interactions (HOI) in video is challenging due to the complex, instance-specific interaction dynamics of both humans and objects. Incorporating controllability in video generation further adds to the complexity. Existing controllable video generation approaches face a trade-off: sparse controls like keypoint trajectories are easy to specify but lack instance-awareness, while dense signals such as optical flow, depths or 3D meshes are informative but costly to obtain. We propose VHOI, a two-stage framework that first densifies sparse trajectories into HOI mask sequences, and then fine-tunes a video diffusion model conditioned on these dense masks. We introduce a novel HOI-aware motion representation that uses color encodings to distinguish not only human and object motion, but also body-part-specific dynamics. This design incorporates a human prior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
