ID-Crafter: VLM-Grounded Online RL for Compositional Multi-Subject Video Generation

Panwang Pan; Jingjing Zhao; Yuchen Lin; Chenguo Lin; Chenxin Li; Hengyu Liu; Tingting Shen; Yadong MU

arXiv:2511.00511·cs.CV·March 16, 2026

ID-Crafter: VLM-Grounded Online RL for Compositional Multi-Subject Video Generation

Panwang Pan, Jingjing Zhao, Yuchen Lin, Chenguo Lin, Chenxin Li, Hengyu Liu, Tingting Shen, Yadong MU

PDF

Open Access

TL;DR

ID-Crafter introduces a novel framework for multi-subject video generation that combines hierarchical attention, vision-language understanding, and reinforcement learning to improve identity preservation and semantic coherence in generated videos.

Contribution

The paper presents a new multi-subject video generation method integrating hierarchical attention, pretrained vision-language models, and online reinforcement learning, along with a new dataset for training and evaluation.

Findings

01

Achieves state-of-the-art results in identity preservation and semantic coherence.

02

Demonstrates superior temporal consistency and video quality.

03

Outperforms existing methods on multiple benchmarks.

Abstract

Significant progress has been achieved in high-fidelity video synthesis, yet current paradigms often fall short in effectively integrating identity information from multiple subjects. This leads to semantic conflicts and suboptimal performance in preserving identities and interactions, limiting controllability and applicability. To tackle this issue, we introduce ID-Crafter, a framework for multi-subject video generation that achieves superior identity preservation and semantic coherence. ID-Crafter integrates three key components: (i) a hierarchical identity-preserving attention mechanism that progressively aggregates features at intra-subject, inter-subject, and cross-modal levels; (ii) a semantic understanding module powered by a pretrained Vision-Language Model (VLM) to provide fine-grained guidance and capture complex inter-subject relationships; and (iii) an online reinforcement…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis