Scaling Video Understanding via Compact Latent Multi-Agent Collaboration

Kerui Chen; Jinglu Wang; Jianrong Zhang; Ming Li; Yan Lu; Hehe Fan

arXiv:2605.00444·cs.CV·May 4, 2026

Scaling Video Understanding via Compact Latent Multi-Agent Collaboration

Kerui Chen, Jinglu Wang, Jianrong Zhang, Ming Li, Yan Lu, Hehe Fan

PDF

TL;DR

The paper introduces MACF, a scalable multi-agent framework for video understanding that preserves visual fidelity and outperforms existing models under budget constraints.

Contribution

MACF decouples perception budgets from video complexity, enabling efficient, high-fidelity multi-agent collaboration through latent communication and curriculum training.

Findings

01

MACF outperforms state-of-the-art models on multiple benchmarks.

02

The framework effectively balances perception budget and video complexity.

03

Latent communication enables scalable and accurate video understanding.

Abstract

Multi-modal large language models (MLLMs) advance vision language understanding but face inherent limitations in long-video tasks due to bounded perception context budgets. Existing agentic methods mitigate this via rule-based preprocessing, yet often suffer from information loss, high cost, and reliance on textual intermediates. We propose MACF, an end-to-end Multi-Agent Collaboration Framework that decouples per-agent perception budgets from global video complexity, enabling scalable video understanding while preserving visual fidelity. MACF partitions videos into segments for locally budgeted agents and enables holistic reasoning via an agent-native latent communication protocol. Each agent encodes partial observations into compact, task-sufficient tokens in a shared embedding space, allowing efficient and information-preserving collaboration by a central coordinator. We introduce a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.