Multimodal Cinematic Video Synthesis Using Text-to-Image and Audio Generation Models

Sridhar S; Nithin A; Shakeel Rifath; Vasantha Raj

arXiv:2506.10005·cs.CV·June 13, 2025

Multimodal Cinematic Video Synthesis Using Text-to-Image and Audio Generation Models

Sridhar S, Nithin A, Shakeel Rifath, Vasantha Raj

PDF

Open Access

TL;DR

This paper presents a comprehensive method for generating high-quality, 60-second cinematic videos from text prompts by integrating state-of-the-art image, audio, and narrative models within a GPU-accelerated environment.

Contribution

It introduces a novel multimodal pipeline combining Stable Diffusion, GPT-2, and hybrid audio generation, with a five-scene framework and advanced post-processing for professional-quality results.

Findings

01

Achieved high visual fidelity and narrative coherence.

02

Demonstrated efficient GPU-based synthesis process.

03

Supported resolutions up to 1024x768 at 15-30 FPS.

Abstract

Advances in generative artificial intelligence have altered multimedia creation, allowing for automatic cinematic video synthesis from text inputs. This work describes a method for creating 60-second cinematic movies incorporating Stable Diffusion for high-fidelity image synthesis, GPT-2 for narrative structuring, and a hybrid audio pipeline using gTTS and YouTube-sourced music. It uses a five-scene framework, which is augmented by linear frame interpolation, cinematic post-processing (e.g., sharpening), and audio-video synchronization to provide professional-quality results. It was created in a GPU-accelerated Google Colab environment using Python 3.11. It has a dual-mode Gradio interface (Simple and Advanced), which supports resolutions of up to 1024x768 and frame rates of 15-30 FPS. Optimizations such as CUDA memory management and error handling ensure reliability. The experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Music Technology and Sound Studies · Multimedia Communication and Technology

MethodsCosine Annealing · Layer Normalization · Linear Warmup With Cosine Annealing · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Discriminative Fine-Tuning · Byte Pair Encoding · Softmax · Linear Layer · Dropout