ConsisDrive: Identity-Preserving Driving World Models for Video Generation by Instance Mask
Zhuoran Yang, Yanyong Zhang

TL;DR
ConsisDrive is a novel world model for autonomous driving video generation that maintains object identity over time using instance-level attention and loss mechanisms, improving realism and downstream task performance.
Contribution
It introduces instance-masked attention and loss components to enforce temporal identity consistency in driving world models, a novel approach in this domain.
Findings
Achieves state-of-the-art quality in driving video generation.
Significantly improves downstream autonomous driving tasks.
Reduces identity drift in generated videos.
Abstract
Autonomous driving relies on robust models trained on large-scale, high-quality multi-view driving videos. Although world models provide a cost-effective solution for generating realistic driving data, they often suffer from identity drift, where the same object changes its appearance or category across frames due to the absence of instance-level temporal constraints. We introduce ConsisDrive, an identity-preserving driving world model designed to enforce temporal consistency at the instance level. Our framework incorporates two key components: (1) Instance-Masked Attention, which applies instance identity masks and trajectory masks within attention blocks to ensure that visual tokens interact only with their corresponding instance features across spatial and temporal dimensions, thereby preserving object identity consistency; and (2) Instance-Masked Loss, which adaptively emphasizes…
Peer Reviews
Decision·ICLR 2026 Poster
- Clearly identifies and addresses “identity drift,” a critical yet understudied issue in driving video generation. - Instance-Masked Attention effectively enforces instance-level consistency by leveraging identity and trajectory masks. - Instance-Masked Loss adaptively balances foreground and background supervision, improving fidelity for small objects.
The paper lacks comparisons with several recent SOTA methods, particularly **InstaDrive** [1], which also focuses on the quality of instance-level generation. Including such comparisons would better contextualize the proposed method’s performance and highlight its relative strengths or limitations in generating high-fidelity instances. [1] Yang Z, Guo X, Ding C, et al. InstaDrive: Instance-Aware Driving World Models for Realistic and Consistent Video Generation[C]//Proceedings of the IEEE/CVF I
S1. This work identifies instance identity drift (including category shifts, color inconsistencies, foreground dilution) as a serious issue for driving-oriented synthetic data, and provides solutions to the unique demands of driving scenes. S2. The experimental results do show advantages over current approaches. Evaluations on downstream tasks are included, providing an important insight that incorporating synthetic data helps to improve performances on downstream tasks. Also, ablations verify
W1. Limited Novelty. I would say that this is a great engineering work with a reasonable pipeline and should produce good results, but lacks significant distinction between this work and previous works. For example, CineMaster proposes using 3D depth box and class labels to achieve semantic layout control with ControlNet, which is quite similar to this work, in my opinion. Also, the authors claim that they propose instance-masked attention and instance-masked loss. However, neither are ground-br
1. The proposed IMA presents a simple yet effective solution by integrating instance-level identity conditioning and cross-frame propagation into the Transformer's 3D self-attention mechanism via instance identity masks and instance trajectory masks. 2. The evaluation is comprehensive. Beyond standard video generation metrics (FID, FVD), the paper thoroughly assesses the utility of the generated data through downstream tasks, including perception and multi-object tracking. 3. The video results i
1. The necessity and advantages of injecting instance attributes (category, size, tracking ID) as a global condition Ginto the attention mechanism via the Instance Identity Mask, compared to traditional conditioning approaches in diffusion models, require deeper discussion and justification. 2. It needs a clearer rationale for why encoding these instance attributes into a global embedding Gand interacting via the Identity Mask Mk,m+iis superior to alternative conditioning strategies, such as inj
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Image Enhancement Techniques
