Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO
Junhao Cheng, Liang Hou, Xin Tao, Jing Liao

TL;DR
This paper introduces VANS, a model that predicts and generates next video events using joint reinforcement learning of vision-language and video diffusion models, enabling more intuitive and visual procedural learning.
Contribution
It proposes a novel joint reinforcement learning framework, Joint-GRPO, to align vision-language and video diffusion models for next-event video prediction and generation.
Findings
Achieves state-of-the-art results on procedural and predictive benchmarks.
Introduces VANS-Data-100K dataset for training and evaluation.
Demonstrates effective video event prediction and visualization capabilities.
Abstract
While language models have become impactful in many real-world applications, video generation remains largely confined to entertainment. Motivated by video's inherent capacity to demonstrate physical-world information that is difficult to convey through language alone (e.g., imagine teaching someone to tie a tie using only text), we identify an underutilized opportunity to extend video as a new answer modality for Next-Event Prediction (NEP), formalized as Video-Next-Event Prediction (VNEP). While the established NEP task takes a video with a procedural or predictive question as input to predict the next event in text, VNEP requires dynamic video responses. This shift from telling to showing unlocks more intuitive and customized answers for procedural learning and creative exploration. However, this task remains challenging for existing models, as it demands an understanding of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
