HiCrew: Hierarchical Reasoning for Long-Form Video Understanding via Question-Aware Multi-Agent Collaboration
Yuehan Zhu, Jingqi Zhao, Jiawen Zhao, Xudong Mao, Baoquan Zhao

TL;DR
HiCrew introduces a hierarchical multi-agent framework for long-form video understanding, effectively capturing temporal and causal dependencies through adaptive collaboration and structured representations.
Contribution
The paper presents a novel hierarchical multi-agent approach with a hybrid tree structure, question-aware captioning, and dynamic planning for improved reasoning in long videos.
Findings
Significant improvements on EgoSchema and NExT-QA datasets.
Enhanced performance in temporal and causal reasoning tasks.
Effective preservation of temporal topology and semantic coherence.
Abstract
Long-form video understanding remains fundamentally challenged by pervasive spatiotemporal redundancy and intricate narrative dependencies that span extended temporal horizons. While recent structured representations compress visual information effectively, they frequently sacrifice temporal coherence, which is critical for causal reasoning. Meanwhile, existing multi-agent frameworks operate through rigid, pre-defined workflows that fail to adapt their reasoning strategies to question-specific demands. In this paper, we introduce HiCrew, a hierarchical multi-agent framework that addresses these limitations through three core contributions. First, we propose a Hybrid Tree structure that leverages shot boundary detection to preserve temporal topology while performing relevance-guided hierarchical clustering within semantically coherent segments. Second, we develop a Question-Aware…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
