Adaptive Multi-Agent Reasoning for Text-to-Video Retrieval
Jiaxin Wu, Xiao-Yong Wei, Qing Li

TL;DR
This paper introduces an adaptive multi-agent framework for zero-shot text-to-video retrieval that dynamically orchestrates specialized agents to improve reasoning and retrieval accuracy, significantly outperforming existing methods.
Contribution
It proposes a novel multi-agent reasoning framework with dynamic coordination and communication mechanisms for improved zero-shot text-to-video retrieval.
Findings
Twofold improvement over CLIP4Clip
Significant outperformance of state-of-the-art methods
Effective handling of complex temporal and logical queries
Abstract
The rise of short-form video platforms and the emergence of multimodal large language models (MLLMs) have amplified the need for scalable, effective, zero-shot text-to-video retrieval systems. While recent advances in large-scale pretraining have improved zero-shot cross-modal alignment, existing methods still struggle with query-dependent temporal reasoning, limiting their effectiveness on complex queries involving temporal, logical, or causal relationships. To address these limitations, we propose an adaptive multi-agent retrieval framework that dynamically orchestrates specialized agents over multiple reasoning iterations based on the demands of each query. The framework includes: (1) a retrieval agent for scalable retrieval over large video corpora, (2) a reasoning agent for zero-shot contextual temporal reasoning, and (3) a query reformulation agent for refining ambiguous queries…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Topic Modeling
