OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

Donghao Zhou; Guisheng Liu; Hao Yang; Jiatong Li; Jingyu Lin; Xiaohu Huang; Yichen Liu; Xin Gao; Cunjian Chen; Shilei Wen; Chi-Wing Fu; Pheng-Ann Heng

arXiv:2604.11804·cs.CV·April 20, 2026

OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

Donghao Zhou, Guisheng Liu, Hao Yang, Jiatong Li, Jingyu Lin, Xiaohu Huang, Yichen Liu, Xin Gao, Cunjian Chen, Shilei Wen, Chi-Wing Fu, Pheng-Ann Heng

PDF

1 Repo 1 Datasets

TL;DR

OmniShow is an end-to-end framework that synthesizes high-quality human-object interaction videos conditioned on multiple modalities, advancing the state-of-the-art in multimodal video generation.

Contribution

The paper introduces OmniShow, a novel multimodal video generation framework with new conditioning techniques and a multi-stage training strategy, plus a comprehensive HOIVG benchmark.

Findings

01

OmniShow achieves state-of-the-art performance across various multimodal conditions.

02

The framework effectively harmonizes text, images, audio, and pose for video synthesis.

03

The HOIVG-Bench provides a new standard for evaluating human-object interaction video generation.

Abstract

In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an end-to-end framework tailored for this practical yet challenging task, capable of harmonizing multimodal conditions and delivering industry-grade performance. To overcome the trade-off between controllability and quality, we introduce Unified Channel-wise Conditioning for efficient image and pose injection, and Gated Local-Context Attention to ensure precise audio-visual synchronization. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

correr-zhou/OmniShow
github

Datasets

donghao-zhou/HOIVG-Bench
dataset· 249 dl
249 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.