HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis

Mingjin Chen; Junhao Chen; Zhaoxin Fan; Yujian Lee; Zichen Dang; Lili Wang; Yawen Cui; Lap-Pui Chau; Yi Wang

arXiv:2604.03305·cs.CV·April 7, 2026

HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis

Mingjin Chen, Junhao Chen, Zhaoxin Fan, Yujian Lee, Zichen Dang, Lili Wang, Yawen Cui, Lap-Pui Chau, Yi Wang

PDF

TL;DR

HVG-3D introduces a 3D-aware diffusion framework for hand-object interaction video synthesis, enabling explicit 3D reasoning and precise control using real or simulated 3D data.

Contribution

It presents a novel diffusion-based architecture with a 3D ControlNet and a hybrid pipeline for flexible, high-quality 3D-conditioned video synthesis.

Findings

01

Achieves state-of-the-art spatial fidelity and temporal coherence.

02

Enables effective use of both real and simulated 3D data.

03

Provides precise spatial and temporal control during video generation.

Abstract

Recent methods have made notable progress in the visual quality of hand-object interaction video synthesis. However, most approaches rely on 2D control signals that lack spatial expressiveness and limit the utilization of synthetic 3D conditional data. To address these limitations, we propose HVG-3D, a unified framework for 3D-aware hand-object interaction (HOI) video synthesis conditioned on explicit 3D representations. Specifically, we develop a diffusion-based architecture augmented with a 3D ControlNet, which encodes geometric and motion cues from 3D inputs to enable explicit 3D reasoning during video synthesis. To achieve high-quality synthesis, HVG-3D is designed with two core components: (i) a 3D-aware HOI video generation diffusion architecture that encodes geometric and motion cues from 3D inputs for explicit 3D reasoning; and (ii) a hybrid pipeline for constructing input and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.