Scaling Video Understanding via Compact Latent Multi-Agent Collaboration
Kerui Chen, Jinglu Wang, Jianrong Zhang, Ming Li, Yan Lu, Hehe Fan

TL;DR
The paper introduces MACF, a scalable multi-agent framework for video understanding that preserves visual fidelity and outperforms existing models under budget constraints.
Contribution
MACF decouples perception budgets from video complexity, enabling efficient, high-fidelity multi-agent collaboration through latent communication and curriculum training.
Findings
MACF outperforms state-of-the-art models on multiple benchmarks.
The framework effectively balances perception budget and video complexity.
Latent communication enables scalable and accurate video understanding.
Abstract
Multi-modal large language models (MLLMs) advance vision language understanding but face inherent limitations in long-video tasks due to bounded perception context budgets. Existing agentic methods mitigate this via rule-based preprocessing, yet often suffer from information loss, high cost, and reliance on textual intermediates. We propose MACF, an end-to-end Multi-Agent Collaboration Framework that decouples per-agent perception budgets from global video complexity, enabling scalable video understanding while preserving visual fidelity. MACF partitions videos into segments for locally budgeted agents and enables holistic reasoning via an agent-native latent communication protocol. Each agent encodes partial observations into compact, task-sufficient tokens in a shared embedding space, allowing efficient and information-preserving collaboration by a central coordinator. We introduce a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
