ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors

Zihao Huang; Tianqi Liu; Zhaoxi Chen; Shaocong Xu; Saining Zhang; Lixing Xiao; Zhiguo Cao; Wei Li; Hao Zhao; Ziwei Liu

arXiv:2603.04338·cs.CV·March 5, 2026

ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors

Zihao Huang, Tianqi Liu, Zhaoxi Chen, Shaocong Xu, Saining Zhang, Lixing Xiao, Zhiguo Cao, Wei Li, Hao Zhao, Ziwei Liu

PDF

Open Access

TL;DR

ArtHOI introduces a novel zero-shot framework that synthesizes articulated human-object interactions by reconstructing 4D scenes from monocular video priors, enabling physically plausible and geometrically consistent interactions without 3D supervision.

Contribution

This work is the first to formulate articulated HOI synthesis as a 4D reconstruction problem from monocular videos, integrating flow-based segmentation and a decoupled optimization pipeline.

Findings

01

Outperforms prior methods in contact accuracy

02

Reduces penetration issues in synthesized interactions

03

Enhances articulation fidelity in diverse scenes

Abstract

Synthesizing physically plausible articulated human-object interactions (HOI) without 3D/4D supervision remains a fundamental challenge. While recent zero-shot approaches leverage video diffusion models to synthesize human-object interactions, they are largely confined to rigid-object manipulation and lack explicit 4D geometric reasoning. To bridge this gap, we formulate articulated HOI synthesis as a 4D reconstruction problem from monocular video priors: given only a video generated by a diffusion model, we reconstruct a full 4D articulated scene without any 3D supervision. This reconstruction-based approach treats the generated 2D video as supervision for an inverse rendering problem, recovering geometrically consistent and physically plausible 4D scenes that naturally respect contact, articulation, and temporal coherence. We introduce ArtHOI, the first zero-shot framework for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Robot Manipulation and Learning · Human Motion and Animation