A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning

Yichang Xu; Gaowen Liu; Ramana Rao Kompella; Tiansheng Huang; Sihao Hu; Fatih Ilhan; Selim Furkan Tekin; Zachary Yahn; Ling Liu

arXiv:2603.14052·cs.CV·March 23, 2026

A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning

Yichang Xu, Gaowen Liu, Ramana Rao Kompella, Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Zachary Yahn, Ling Liu

PDF

Open Access

TL;DR

This paper introduces A4VL, a multi-agent perception-action framework that enhances long-video reasoning efficiency and accuracy by iterative exploration, multi-agent collaboration, and event-driven partitioning, outperforming existing methods on VideoQA benchmarks.

Contribution

The paper proposes a novel multi-agent perception-action alliance, A4VL, for scalable and efficient long-video reasoning with iterative exploration and collaborative decision-making.

Findings

01

A4VL outperforms 18 existing VLMs on VideoQA benchmarks.

02

A4VL achieves significantly lower inference latency.

03

A4VL effectively scales to real-world long videos.

Abstract

This paper presents a multi-agent perception-action exploration alliance, dubbed A4VL, for efficient long-video reasoning. A4VL operates in a multi-round perception-action exploration loop with a selection of VLM agents. In each round, the team of agents performs video question-answer (VideoQA) via perception exploration followed by action exploration. During perception exploration, each agent learns to extract query-specific perception clue(s) from a few sampled frames and performs clue-based alignment to find the video block(s) that are most relevant to the query-specific event. During action exploration, A4VL performs video reasoning in three steps: (1) each agent produces its initial answer with rational, (2) all agents collaboratively scores one another through cross-reviews and relevance ranking, and (3) based on whether a satisfactory consensus is reached, the decision is made…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis