Consistency-Preserving Diverse Video Generation

Xinshuang Liu; Runfa Blark Li; Truong Nguyen

arXiv:2602.15287·cs.CV·February 18, 2026

Consistency-Preserving Diverse Video Generation

Xinshuang Liu, Runfa Blark Li, Truong Nguyen

PDF

Open Access

TL;DR

This paper introduces a joint-sampling framework for text-to-video generation that enhances diversity and preserves temporal consistency without costly backpropagation, leading to more natural and varied videos.

Contribution

It proposes a novel joint-sampling method that improves video diversity and temporal consistency efficiently using lightweight latent-space models.

Findings

01

Achieves diversity comparable to strong baselines.

02

Significantly improves temporal consistency.

03

Enhances color naturalness in generated videos.

Abstract

Text-to-video generation is expensive, so only a few samples are typically produced per prompt. In this low-sample regime, maximizing the value of each batch requires high cross-video diversity. Recent methods improve diversity for image generation, but for videos they often degrade within-video temporal consistency and require costly backpropagation through a video decoder. We propose a joint-sampling framework for flow-matching video generators that improves batch diversity while preserving temporal consistency. Our approach applies diversity-driven updates and then removes only the components that would decrease a temporal-consistency objective. To avoid image-space gradients, we compute both objectives with lightweight latent-space models, avoiding video decoding and decoder backpropagation. Experiments on a state-of-the-art text-to-video flow-matching model show diversity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Multimodal Machine Learning Applications