TL;DR
EVA introduces an efficient reinforcement learning framework for end-to-end video understanding, enabling autonomous, query-driven analysis of videos with improved performance over existing methods.
Contribution
EVA's novel planning-before-perception approach and three-stage training pipeline significantly enhance video understanding efficiency and accuracy.
Findings
EVA outperforms baselines by 6-12% on six benchmarks.
EVA achieves 1-3% higher accuracy than prior adaptive agent methods.
The three-stage training pipeline stabilizes and improves agent training.
Abstract
Video understanding with multimodal large language models (MLLMs) remains challenging due to the long token sequences of videos, which contain extensive temporal dependencies and redundant frames. Existing approaches typically treat MLLMs as passive recognizers, processing entire videos or uniformly sampled frames without adaptive reasoning. Recent agent-based methods introduce external tools, yet still depend on manually designed workflows and perception-first strategies, resulting in inefficiency on long videos. We present EVA, an Efficient Reinforcement Learning framework for End-to-End Video Agent, which enables planning-before-perception through iterative summary-plan-action-reflection reasoning. EVA autonomously decides what to watch, when to watch, and how to watch, achieving query-driven and efficient video understanding. To train such agents, we design a simple yet effective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
