ViHOI: Human-Object Interaction Synthesis with Visual Priors

Songjin Cai; Linjie Zhong; Ling Guo; Changxing Ding

arXiv:2603.24383·cs.CV·March 26, 2026

ViHOI: Human-Object Interaction Synthesis with Visual Priors

Songjin Cai, Linjie Zhong, Ling Guo, Changxing Ding

PDF

Open Access

TL;DR

ViHOI introduces a diffusion-based framework that leverages visual priors extracted from 2D images using large vision-language models to generate realistic 3D human-object interactions with improved generalization.

Contribution

The paper presents a novel approach combining vision-language models and diffusion techniques to enhance 3D HOI synthesis from 2D image priors, addressing physical plausibility and generalization.

Findings

01

Achieves state-of-the-art performance on multiple benchmarks.

02

Demonstrates superior generalization to unseen objects and interactions.

03

Effectively leverages visual priors for realistic motion generation.

Abstract

Generating realistic and physically plausible 3D Human-Object Interactions (HOI) remains a key challenge in motion generation. One primary reason is that describing these physical constraints with words alone is difficult. To address this limitation, we propose a new paradigm: extracting rich interaction priors from easily accessible 2D images. Specifically, we introduce ViHOI, a novel framework that enables diffusion-based generative models to leverage rich, task-specific priors from 2D images to enhance generation quality. We utilize a large Vision-Language Model (VLM) as a powerful prior-extraction engine and adopt a layer-decoupled strategy to obtain visual and textual priors. Concurrently, we design a Q-Former-based adapter that compresses the VLM's high-dimensional features into compact prior tokens, which significantly facilitates the conditional training of our diffusion model.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · 3D Shape Modeling and Analysis