VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning

Boyu Chen; Zikang Wang; Zhengrong Yue; Kainan Yan; Chenyun Yu; Yi Huang; Zijun Liu; Yafei Wen; Xiaoxin Chen; Yang Liu; Peng Li; Yali Wang

arXiv:2511.19524·cs.CV·March 5, 2026

VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning

Boyu Chen, Zikang Wang, Zhengrong Yue, Kainan Yan, Chenyun Yu, Yi Huang, Zijun Liu, Yafei Wen, Xiaoxin Chen, Yang Liu, Peng Li, Yali Wang

PDF

Open Access

TL;DR

VideoChat-M1 introduces a multi-agent reinforcement learning framework with collaborative policy planning for improved video understanding, enabling dynamic tool invocation and inter-agent communication to achieve state-of-the-art results.

Contribution

It proposes a novel multi-agent system with collaborative policy planning and reinforcement learning for adaptive, context-aware video understanding.

Findings

01

Achieves state-of-the-art performance on eight benchmarks.

02

Outperforms existing models like Gemini 2.5 pro and GPT-4o significantly.

03

Demonstrates effective multi-agent collaboration for complex video tasks.

Abstract

By leveraging tool-augmented Multimodal Large Language Models (MLLMs), multi-agent frameworks are driving progress in video understanding. However, most of them adopt static and non-learnable tool invocation mechanisms, which limit the discovery of diverse clues essential for robust perception and reasoning regarding temporally or spatially complex videos. To address this challenge, we propose a novel Multi-agent system for video understanding, namely VideoChat-M1. Instead of using a single or fixed policy, VideoChat-M1 adopts a distinct Collaborative Policy Planning (CPP) paradigm with multiple policy agents, which comprises three key processes. (1) Policy Generation: Each agent generates its unique tool invocation policy tailored to the user's query; (2) Policy Execution: Each agent sequentially invokes relevant tools to execute its policy and explore the video content; (3) Policy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling