Dexterous World Models
Byungjun Kim, Taeksoo Kim, Junyoung Lee, Hanbyul Joo

TL;DR
Dexterous World Model (DWM) is a novel scene-action-conditioned video diffusion framework that generates realistic, dynamic, and interactive 3D scene videos based on static scene renderings and egocentric hand motions, advancing digital twin interactivity.
Contribution
Introduces DWM, a new diffusion-based approach that models how human actions induce dynamic changes in static 3D scenes, enabling interactive digital twins with embodied simulation capabilities.
Findings
DWM produces realistic, physically plausible human-scene interactions.
The framework maintains scene and camera consistency during interactions.
Experiments validate DWM's ability to generate diverse object manipulations.
Abstract
Recent progress in 3D reconstruction has made it easy to create realistic digital twins from everyday environments. However, current digital twins remain largely static and are limited to navigation and view synthesis without embodied interactivity. To bridge this gap, we introduce Dexterous World Model (DWM), a scene-action-conditioned video diffusion framework that models how dexterous human actions induce dynamic changes in static 3D scenes. Given a static 3D scene rendering and an egocentric hand motion sequence, DWM generates temporally coherent videos depicting plausible human-scene interactions. Our approach conditions video generation on (1) static scene renderings following a specified camera trajectory to ensure spatial consistency, and (2) egocentric hand mesh renderings that encode both geometry and motion cues to model action-conditioned dynamics directly. To train DWM,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Human Motion and Animation · Generative Adversarial Networks and Image Synthesis
