LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation

Jiazheng Xing; Fei Du; Hangjie Yuan; Pengwei Liu; Hongbin Xu; Hai Ci; Ruigang Niu; Weihua Chen; Fan Wang; Yong Liu

arXiv:2603.20192·cs.CV·March 23, 2026

LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation

Jiazheng Xing, Fei Du, Hangjie Yuan, Pengwei Liu, Hongbin Xu, Hai Ci, Ruigang Niu, Weihua Chen, Fan Wang, Yong Liu

PDF

Open Access 1 Models 3 Reviews

TL;DR

LumosX introduces a novel framework combining data collection and advanced attention mechanisms to improve identity consistency and attribute alignment in personalized multi-subject video generation using diffusion models.

Contribution

It proposes a comprehensive approach with a new data pipeline and relational attention modules to enhance face-attribute alignment and intra-group consistency in personalized videos.

Findings

01

Achieves state-of-the-art results in personalized multi-subject video generation.

02

Demonstrates improved identity consistency and semantic alignment.

03

Provides a new benchmark for evaluating personalized video generation methods.

Abstract

Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained control over both foreground and background elements. However, precise face-attribute alignment across subjects remains challenging, as existing methods lack explicit mechanisms to ensure intra-group consistency. Addressing this gap requires both explicit modeling strategies and face-attribute-aware data resources. We therefore propose LumosX, a framework that advances both data and model design. On the data side, a tailored collection pipeline orchestrates captions and visual cues from independent videos, while multimodal large language models (MLLMs) infer and assign subject-specific dependencies. These extracted relational priors impose a finer-grained structure that amplifies the expressive control of personalized video generation and…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- The paper addresses an important and challenging problem in video generation: enabling both personalization and semantic control, areas where previous works have struggled. - The paper introduces the dataset for this task and the designs on model side.

Weaknesses

- The focus on generating videos based solely on facial attributes may limit the method’s generalizability to broader contexts. - It is unclear how the approach performs when handling videos with multiple subjects (three or more), as the datasets support up to three subjects. - The implementation details lack specifics on computational requirements such as GPU count and training duration. - The definition of a "subject" during data curation is ambiguous. Figures 3 and 4 suggest that a subject ma

Reviewer 02Rating 6Confidence 1

Strengths

1. Combines MLLM-driven data annotation with relational attention to solve a misalignment problem—an innovative fusion of NLP (entity extraction) and computer vision (positional embedding/attention masking) techniques. 2. The problem statement (face-attribute misalignment) is clearly articulated with examples (e.g., "A man on the left... and a man on the right..." causing confusion). 3. Enables flexible multi-subject customization (foreground/background control) that prior models lack.

Weaknesses

1. The training dataset only includes 1–3 subjects, the paper acknowledges instability for 10+ subjects due to RoPE extrapolation. No preliminary results or mitigation detailsare provided beyond a future work note. 2. All evaluations rely on automated metrics. Human judgment of face-attribute alignment, video naturalness, and prompt adherence would strengthen claims (e.g., do viewers perceive fewer misassignments in LumosX-generated videos?).

Reviewer 03Rating 4Confidence 5

Strengths

- This paper focuses the problem of identity-attribute consistency, which sounds interesting but has been well explored topic in the domain of multi-subject video personalization. - From the Tables 1~3, LumosX shows strong quantitative results comparing the SOTA models, like Phantom and SkyReels. - The paper is well written and easy to follow. The figures are well-plotted and informative which can make readers quickly understand the core ideas.

Weaknesses

- Although the paper focuses on maintaining consistent identity-attribute pairing, the author only show one generated example with multiple identity-attribute pairs (the right example in Figure 5). With such one sample, it is unconvincing to claim that LumosX is capable of solving the issue of inconsistent identity-attribute pairing. - While interesting, it is unclear whether the issue of inconsistent identity-attribute pairing is an actual problem. First, in the only multi-subject example provi

Code & Models

Models

🤗
Alibaba-DAMO-Academy/LumosX
model· 83 dl· ♡ 24
83 dl♡ 24

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis