MVHOI: Bridge Multi-view Condition to Complex Human-Object Interaction Video Reenactment via 3D Foundation Model

Jinguang Tong; Jinbo Wu; Kaisiyuan Wang; Zhelun Shen; Xuan Huang; Mochu Xiang; Xuesong Li; Yingying Li; Haocheng Feng; Chen Zhao; Hang Zhou; Wei He; Chuong Nguyen; Jingdong Wang; Hongdong Li

arXiv:2603.14686·cs.CV·March 17, 2026

MVHOI: Bridge Multi-view Condition to Complex Human-Object Interaction Video Reenactment via 3D Foundation Model

Jinguang Tong, Jinbo Wu, Kaisiyuan Wang, Zhelun Shen, Xuan Huang, Mochu Xiang, Xuesong Li, Yingying Li, Haocheng Feng, Chen Zhao, Hang Zhou, Wei He, Chuong Nguyen, Jingdong Wang, Hongdong Li

PDF

Open Access

TL;DR

This paper introduces MVHOI, a two-stage framework that leverages a 3D foundation model to enable realistic, complex human-object interaction video reenactment with multi-view references, surpassing prior methods in handling 3D manipulations.

Contribution

The paper proposes a novel two-stage HOI video reenactment framework that integrates multi-view conditions with a 3D foundation model for improved realism and complex object manipulation.

Findings

01

Outperforms prior approaches in complex 3D object manipulations

02

Generates long-duration HOI videos with high fidelity

03

Ensures appearance consistency across views

Abstract

Human-Object Interaction (HOI) video reenactment with realistic motion remains a frontier in expressive digital human creation. Existing approaches primarily handle simple image-plane motion (e.g., in-plane translations), struggling with complex non-planar manipulations like out-of-plane reorientation. In this paper, we propose MVHOI, a two-stage HOI video reenactment framework that bridges multi-view reference conditions and video foundation models via a 3D Foundation Model (3DFM). The 3DFM first produces view-consistent object priors conditioned on implicit motion dynamics across novel viewpoints. A controllable video generation model then synthesizes high-fidelity object texture by incorporating multi-view reference images, ensuring appearance consistency via a reasonable retrieval mechanism. By enabling these two stages to mutually reinforce one another during the inference phase,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Multimodal Machine Learning Applications