ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding

Yiyang Zhou; Yangfan He; Yaofeng Su; Siwei Han; Joel Jang; Gedas Bertasius; Mohit Bansal; Huaxiu Yao

arXiv:2506.01300·cs.CV·June 3, 2025

ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding

Yiyang Zhou, Yangfan He, Yaofeng Su, Siwei Han, Joel Jang, Gedas Bertasius, Mohit Bansal, Huaxiu Yao

PDF

Open Access

TL;DR

ReAgent-V is a flexible, reward-driven multi-agent framework that improves video understanding by enabling real-time feedback, iterative reasoning, and data filtering, leading to significant performance gains across multiple tasks.

Contribution

The paper introduces ReAgent-V, a novel framework integrating real-time reward signals and multi-perspective reflection for enhanced video understanding and reasoning.

Findings

01

Achieves up to 6.9% improvement in video understanding accuracy.

02

Enhances reasoning capabilities across 12 datasets.

03

Supports flexible tool integration for diverse tasks.

Abstract

Video understanding is fundamental to tasks such as action recognition, video reasoning, and robotic control. Early video understanding methods based on large vision-language models (LVLMs) typically adopt a single-pass reasoning paradigm without dynamic feedback, limiting the model's capacity to self-correct and adapt in complex scenarios. Recent efforts have attempted to address this limitation by incorporating reward models and reinforcement learning to enhance reasoning, or by employing tool-agent frameworks. However, these approaches face several challenges, including high annotation costs, reward signals that fail to capture real-time reasoning states, and low inference efficiency. To overcome these issues, we propose ReAgent-V, a novel agentic video understanding framework that integrates efficient frame selection with real-time reward generation during inference. These reward…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI) · Machine Learning in Healthcare