TL;DR
InfLVG is a novel inference-time framework that enables long, coherent video generation by dynamically selecting relevant context tokens, improving consistency and semantic fidelity without additional long-form data.
Contribution
We propose InfLVG, a learnable context selection policy optimized with GRPO, to extend autoregressive text-to-video models for long videos while maintaining quality and consistency.
Findings
Extends video length by up to 9 times.
Maintains strong cross-scene consistency.
Achieves high semantic fidelity across scenes.
Abstract
Recent advances in text-to-video generation, particularly with autoregressive models, have enabled the synthesis of high-quality videos depicting individual scenes. However, extending these models to generate long, cross-scene videos remains a significant challenge. As the context length grows during autoregressive decoding, computational costs rise sharply, and the model's ability to maintain consistency and adhere to evolving textual prompts deteriorates. We introduce InfLVG, an inference-time framework that enables coherent long video generation without requiring additional long-form video data. InfLVG leverages a learnable context selection policy, optimized via Group Relative Policy Optimization (GRPO), to dynamically identify and retain the most semantically relevant context throughout the generation process. Instead of accumulating the entire generation history, the policy ranks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSparse Evolutionary Training
