Symphony: A Cognitively-Inspired Multi-Agent System for Long-Video Understanding
Haiyang Yan, Hongyun Zhou, Peng Xu, Xiaoxue Feng, Mengyi Liu

TL;DR
Symphony is a multi-agent system inspired by human cognition that decomposes long videos into subtasks, uses deep reasoning and grounding to improve understanding, and achieves state-of-the-art results in long-video understanding benchmarks.
Contribution
The paper introduces Symphony, a novel multi-agent framework that enhances long-video understanding through cognitive-inspired decomposition, reasoning, and grounding mechanisms.
Findings
Achieves 5.0% improvement on LVBench over previous methods.
Effectively decomposes complex long-video tasks into manageable subtasks.
Enhances reasoning and relevance assessment in long-video understanding.
Abstract
Despite rapid developments and widespread applications of MLLM agents, they still struggle with long-form video understanding (LVU) tasks, which are characterized by high information density and extended temporal spans. Recent research on LVU agents demonstrates that simple task decomposition and collaboration mechanisms are insufficient for long-chain reasoning tasks. Moreover, directly reducing the time context through embedding-based retrieval may lose key information of complex problems. In this paper, we propose Symphony, a multi-agent system, to alleviate these limitations. By emulating human cognition patterns, Symphony decomposes LVU into fine-grained subtasks and incorporates a deep reasoning collaboration mechanism enhanced by reflection, effectively improving the reasoning capability. Additionally, Symphony provides a VLM-based grounding approach to analyze LVU tasks and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Reinforcement Learning in Robotics
