Video Is Worth a Thousand Images: Exploring the Latest Trends in Long Video Generation
Faraz Waseem, Muhammad Shahzad

TL;DR
This survey reviews recent advances in long video generation, highlighting challenges, techniques like GANs and diffusion models, and future research directions to improve scalability and quality.
Contribution
It provides a comprehensive overview of current methods, datasets, metrics, and challenges in long video generation, guiding future research in the field.
Findings
Current systems are limited to short videos up to one minute.
Integrating AI with divide-and-conquer strategies can enhance scalability.
Identifies key challenges and future research directions in long video generation.
Abstract
An image may convey a thousand words, but a video composed of hundreds or thousands of image frames tells a more intricate story. Despite significant progress in multimodal large language models (MLLMs), generating extended videos remains a formidable challenge. As of this writing, OpenAI's Sora, the current state-of-the-art system, is still limited to producing videos that are up to one minute in length. This limitation stems from the complexity of long video generation, which requires more than generative AI techniques for approximating density functions essential aspects such as planning, story development, and maintaining spatial and temporal consistency present additional hurdles. Integrating generative AI with a divide-and-conquer approach could improve scalability for longer videos while offering greater control. In this survey, we examine the current landscape of long video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Games and Media · Cinema and Media Studies
MethodsDiffusion
