ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation

Mingyang Wu; Ashirbad Mishra; Soumik Dey; Shuo Xing; Naveen Ravipati; Hansi Wu; Binbin Li; Zhengzhong Tu

arXiv:2602.10113·cs.CV·February 11, 2026

ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation

Mingyang Wu, Ashirbad Mishra, Soumik Dey, Shuo Xing, Naveen Ravipati, Hansi Wu, Binbin Li, Zhengzhong Tu

PDF

Open Access 1 Datasets

TL;DR

ConsID-Gen introduces a view-assisted image-to-video generation framework that enhances multi-view consistency and identity preservation by leveraging auxiliary views and a dual-stream encoder, outperforming existing models.

Contribution

The paper presents a new dataset, ConsIDVid, and a benchmarking framework, along with a novel view-assisted generation method, improving identity fidelity and temporal coherence in image-to-video tasks.

Findings

01

ConsID-Gen outperforms state-of-the-art models on ConsIDVid-Bench.

02

The framework achieves superior identity preservation and temporal coherence.

03

Experiments validate the effectiveness of auxiliary views and dual-stream encoding.

Abstract

Image-to-Video generation (I2V) animates a static image into a temporally coherent video sequence following textual instructions, yet preserving fine-grained object identity under changing viewpoints remains a persistent challenge. Unlike text-to-video models, existing I2V pipelines often suffer from appearance drift and geometric distortion, artifacts we attribute to the sparsity of single-view 2D observations and weak cross-modal alignment. Here we address this problem from both data and model perspectives. First, we curate ConsIDVid, a large-scale object-centric dataset built with a scalable pipeline for high-quality, temporally aligned videos, and establish ConsIDVid-Bench, where we present a novel benchmarking and evaluation framework for multi-view consistency using metrics sensitive to subtle geometric and appearance deviations. We further propose ConsID-Gen, a view-assisted I2V…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

mingyang-wu/ConsIDVid
dataset· 45k dl
45k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Multimodal Machine Learning Applications