EVA: Efficient Reinforcement Learning for End-to-End Video Agent

Yaolun Zhang; Ruohui Wang; Jiahao Wang; Yepeng Tang; Xuanyu Zheng; Haonan Duan; Hao Lu; Hanming Deng; Lewei Lu

arXiv:2603.22918·cs.CV·March 30, 2026

EVA: Efficient Reinforcement Learning for End-to-End Video Agent

Yaolun Zhang, Ruohui Wang, Jiahao Wang, Yepeng Tang, Xuanyu Zheng, Haonan Duan, Hao Lu, Hanming Deng, Lewei Lu

PDF

1 Repo 1 Models

TL;DR

EVA introduces an efficient reinforcement learning framework for end-to-end video understanding, enabling autonomous, query-driven analysis of videos with improved performance over existing methods.

Contribution

EVA's novel planning-before-perception approach and three-stage training pipeline significantly enhance video understanding efficiency and accuracy.

Findings

01

EVA outperforms baselines by 6-12% on six benchmarks.

02

EVA achieves 1-3% higher accuracy than prior adaptive agent methods.

03

The three-stage training pipeline stabilizes and improves agent training.

Abstract

Video understanding with multimodal large language models (MLLMs) remains challenging due to the long token sequences of videos, which contain extensive temporal dependencies and redundant frames. Existing approaches typically treat MLLMs as passive recognizers, processing entire videos or uniformly sampled frames without adaptive reasoning. Recent agent-based methods introduce external tools, yet still depend on manually designed workflows and perception-first strategies, resulting in inefficiency on long videos. We present EVA, an Efficient Reinforcement Learning framework for End-to-End Video Agent, which enables planning-before-perception through iterative summary-plan-action-reflection reasoning. EVA autonomously decides what to watch, when to watch, and how to watch, achieving query-driven and efficient video understanding. To train such agents, we design a simple yet effective…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wangruohui/EfficientVideoAgent
github

Models

🤗
WRHC/EfficientVideoAgent
model· 14 dl· ♡ 3
14 dl♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.