Kaleido: Open-Sourced Multi-Subject Reference Video Generation Model
Zhenxing Zhang, Jiayan Teng, Zhuoyi Yang, Tiankun Cao, Cheng Wang, Xiaotao Gu, Jie Tang, Dan Guo, Meng Wang

TL;DR
Kaleido is a novel multi-subject video generation framework that improves consistency and fidelity by using a new data pipeline and Reference Rotary Positional Encoding, advancing the state-of-the-art in subject-to-video synthesis.
Contribution
The paper introduces a new data construction pipeline and R-RoPE mechanism, significantly enhancing multi-subject consistency and fidelity in reference video generation.
Findings
Outperforms previous methods in multiple benchmarks
Achieves higher subject consistency and background disentanglement
Demonstrates improved generalization across diverse subjects
Abstract
We present Kaleido, a subject-to-video~(S2V) generation framework, which aims to synthesize subject-consistent videos conditioned on multiple reference images of target subjects. Despite recent progress in S2V generation models, existing approaches remain inadequate at maintaining multi-subject consistency and at handling background disentanglement, often resulting in lower reference fidelity and semantic drift under multi-image conditioning. These shortcomings can be attributed to several factors. Primarily, the training dataset suffers from a lack of diversity and high-quality samples, as well as cross-paired data, i.e., paired samples whose components originate from different instances. In addition, the current mechanism for integrating multiple reference images is suboptimal, potentially resulting in the confusion of multiple subjects. To overcome these limitations, we propose a…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Introduce a pipeline to enhance subject and scene diversity, improve overall data fidelity, and ensure clear separation of subjects from irrelevant components. 2. A reference-based position encoding to emphasize the references, leading to better results.
1. The paper use CLIP as evaluation metrics. However, CLIP is not finegrained enough for Subject consistency. I suggest using face recognition metrics for human faces.
1.The proposed large-scale, cross-paired data construction process is well-designed and will be valuable for the community. 2. Comprehensive experiments: Evaluation covers humans, objects, and multi-subject settings, with both quantitative and user studies.
1. The architectural novelty is limited. The model mainly relies on simple concatenation for conditioning; R-RoPE, while useful, is a modest modification. Besides, its design is mostly empirical without deeper analysis. 2. The validation of proposed dataset is missing. It lacks quantitative evidence for dataset diversity and annotation accuracy, as well as the comparision with previous dataset.
- The paper is well-written and easy to understand. - The proposed data collection pipeline takes into account the cross-paired images, that can solve the background leakage problems during training. - The proposed R-RoPE is simple but effective to disentangle denoised image from condition. - The model achieves SOTA results, and it is fully open sourced and faciliate the community.
- This proposed framework concatenates tokens but not token-channels, which may make the inference slow. - The paper does not discuss why they did not use channel-wise concatenation, which is efficient and widely adopted. - For R-RoPE, why the t-dim of RoPE for refernces images is not shift-T? - The paper lacks novelty and is mostly engineer work, but it should be fine.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Image Enhancement Techniques · Advanced Vision and Imaging
