EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models
Shiqi Huang, Ziyue Wang, Zhongrong Zuo, Han Qiu, Qi She, Bihan Wen

TL;DR
EvoVid introduces a temporal-centric self-evolving framework for Video-LLMs, enabling autonomous improvement from raw videos by leveraging temporal rewards, thus reducing reliance on human annotations.
Contribution
The paper presents a novel temporal-centric self-evolution approach for Video-LLMs, incorporating new rewards for temporal reasoning and supervision from unannotated videos.
Findings
Consistent performance improvements across multiple models and benchmarks.
Achieves competitive results with supervised methods.
Demonstrates scalability and effectiveness of temporal-centric self-evolution.
Abstract
Recent Video Large Language Models (Video-LLMs) have demonstrated strong capabilities in video reasoning through reinforcement learning (RL). However, existing RL pipelines rely heavily on human-annotated tasks and solutions, making them costly to scale and fundamentally constrained by human expertise. Self-evolving frameworks have recently emerged as a promising alternative through autonomous Questioner-Solver self-play. Unfortunately, these approaches are primarily designed for static modalities such as text and images, fundamentally failing to capture the temporal dynamics that are central to video reasoning. In this work, we propose , a temporal-centric self-evolving framework that enables Video-LLMs to improve directly from raw, unannotated videos. Specifically, we introduce two complementary temporal-centric rewards: a temporal-aware Questioner reward that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
