ViHOI: Human-Object Interaction Synthesis with Visual Priors
Songjin Cai, Linjie Zhong, Ling Guo, Changxing Ding

TL;DR
ViHOI introduces a diffusion-based framework that leverages visual priors extracted from 2D images using large vision-language models to generate realistic 3D human-object interactions with improved generalization.
Contribution
The paper presents a novel approach combining vision-language models and diffusion techniques to enhance 3D HOI synthesis from 2D image priors, addressing physical plausibility and generalization.
Findings
Achieves state-of-the-art performance on multiple benchmarks.
Demonstrates superior generalization to unseen objects and interactions.
Effectively leverages visual priors for realistic motion generation.
Abstract
Generating realistic and physically plausible 3D Human-Object Interactions (HOI) remains a key challenge in motion generation. One primary reason is that describing these physical constraints with words alone is difficult. To address this limitation, we propose a new paradigm: extracting rich interaction priors from easily accessible 2D images. Specifically, we introduce ViHOI, a novel framework that enables diffusion-based generative models to leverage rich, task-specific priors from 2D images to enhance generation quality. We utilize a large Vision-Language Model (VLM) as a powerful prior-extraction engine and adopt a layer-decoupled strategy to obtain visual and textual priors. Concurrently, we design a Q-Former-based adapter that compresses the VLM's high-dimensional features into compact prior tokens, which significantly facilitates the conditional training of our diffusion model.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · 3D Shape Modeling and Analysis
