A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning
Yichang Xu, Gaowen Liu, Ramana Rao Kompella, Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Zachary Yahn, Ling Liu

TL;DR
This paper introduces A4VL, a multi-agent perception-action framework that enhances long-video reasoning efficiency and accuracy by iterative exploration, multi-agent collaboration, and event-driven partitioning, outperforming existing methods on VideoQA benchmarks.
Contribution
The paper proposes a novel multi-agent perception-action alliance, A4VL, for scalable and efficient long-video reasoning with iterative exploration and collaborative decision-making.
Findings
A4VL outperforms 18 existing VLMs on VideoQA benchmarks.
A4VL achieves significantly lower inference latency.
A4VL effectively scales to real-world long videos.
Abstract
This paper presents a multi-agent perception-action exploration alliance, dubbed A4VL, for efficient long-video reasoning. A4VL operates in a multi-round perception-action exploration loop with a selection of VLM agents. In each round, the team of agents performs video question-answer (VideoQA) via perception exploration followed by action exploration. During perception exploration, each agent learns to extract query-specific perception clue(s) from a few sampled frames and performs clue-based alignment to find the video block(s) that are most relevant to the query-specific event. During action exploration, A4VL performs video reasoning in three steps: (1) each agent produces its initial answer with rational, (2) all agents collaboratively scores one another through cross-reviews and relevance ranking, and (3) based on whether a satisfactory consensus is reached, the decision is made…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis
