OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation
Donghao Zhou, Guisheng Liu, Hao Yang, Jiatong Li, Jingyu Lin, Xiaohu Huang, Yichen Liu, Xin Gao, Cunjian Chen, Shilei Wen, Chi-Wing Fu, Pheng-Ann Heng

TL;DR
OmniShow is an end-to-end framework that synthesizes high-quality human-object interaction videos conditioned on multiple modalities, advancing the state-of-the-art in multimodal video generation.
Contribution
The paper introduces OmniShow, a novel multimodal video generation framework with new conditioning techniques and a multi-stage training strategy, plus a comprehensive HOIVG benchmark.
Findings
OmniShow achieves state-of-the-art performance across various multimodal conditions.
The framework effectively harmonizes text, images, audio, and pose for video synthesis.
The HOIVG-Bench provides a new standard for evaluating human-object interaction video generation.
Abstract
In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an end-to-end framework tailored for this practical yet challenging task, capable of harmonizing multimodal conditions and delivering industry-grade performance. To overcome the trade-off between controllability and quality, we introduce Unified Channel-wise Conditioning for efficient image and pose injection, and Gated Local-Context Attention to ensure precise audio-visual synchronization. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
