AI Powered High Quality Text to Video Generation with Enhanced Temporal Consistency

Piyushkumar Patel

arXiv:2511.00107·cs.CV·November 4, 2025

AI Powered High Quality Text to Video Generation with Enhanced Temporal Consistency

Piyushkumar Patel

PDF

Open Access

TL;DR

This paper introduces MOVAI, a hierarchical framework for text-to-video generation that significantly improves temporal consistency, visual quality, and semantic control by integrating scene understanding and diffusion models.

Contribution

The paper presents a novel hierarchical framework with a scene parser, attention mechanism, and refinement module, advancing the state-of-the-art in text-to-video synthesis.

Findings

01

Achieves 15.3% improvement in LPIPS quality metric.

02

Improves FVD by 12.7%, indicating better temporal coherence.

03

Outperforms existing methods in user preference studies by 18.9%.

Abstract

Text to video generation has emerged as a critical frontier in generative artificial intelligence, yet existing approaches struggle with maintaining temporal consistency, compositional understanding, and fine grained control over visual narratives. We present MOVAI (Multimodal Original Video AI), a novel hierarchical framework that integrates compositional scene understanding with temporal aware diffusion models for high fidelity text to video synthesis. Our approach introduces three key innovations: (1) a Compositional Scene Parser (CSP) that decomposes textual descriptions into hierarchical scene graphs with temporal annotations, (2) a Temporal-Spatial Attention Mechanism (TSAM) that ensures coherent motion dynamics across frames while preserving spatial details, and (3) a Progressive Video Refinement (PVR) module that iteratively enhances video quality through multi-scale temporal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Human Motion and Animation