PolyVivid: Vivid Multi-Subject Video Generation with Cross-Modal Interaction and Enhancement

Teng Hu; Zhentao Yu; Zhengguang Zhou; Jiangning Zhang; Yuan Zhou; Qinglin Lu; Ran Yi

arXiv:2506.07848·cs.CV·June 10, 2025

PolyVivid: Vivid Multi-Subject Video Generation with Cross-Modal Interaction and Enhancement

Teng Hu, Zhentao Yu, Zhengguang Zhou, Jiangning Zhang, Yuan Zhou, Qinglin Lu, Ran Yi

PDF

Open Access

TL;DR

PolyVivid introduces a novel multi-subject video generation framework that leverages cross-modal interaction and enhancement modules to achieve identity consistency, realistic quality, and precise subject control in generated videos.

Contribution

The paper presents a comprehensive framework with innovative modules for accurate multi-subject video generation, including a VLLM-based fusion, 3D-RoPE enhancement, and an identity injection mechanism.

Findings

01

Outperforms existing methods in identity fidelity and realism.

02

Effectively reduces subject ambiguity and drift.

03

Enhances multi-subject interaction and control.

Abstract

Despite recent advances in video generation, existing models still lack fine-grained controllability, especially for multi-subject customization with consistent identity and interaction. In this paper, we propose PolyVivid, a multi-subject video customization framework that enables flexible and identity-consistent generation. To establish accurate correspondences between subject images and textual entities, we design a VLLM-based text-image fusion module that embeds visual identities into the textual space for precise grounding. To further enhance identity preservation and subject interaction, we propose a 3D-RoPE-based enhancement module that enables structured bidirectional fusion between text and image embeddings. Moreover, we develop an attention-inherited identity injection module to effectively inject fused identity features into the video generation process, mitigating identity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis