Adaptive Multi-Agent Reasoning for Text-to-Video Retrieval

Jiaxin Wu; Xiao-Yong Wei; Qing Li

arXiv:2602.19040·cs.IR·February 24, 2026

Adaptive Multi-Agent Reasoning for Text-to-Video Retrieval

Jiaxin Wu, Xiao-Yong Wei, Qing Li

PDF

Open Access

TL;DR

This paper introduces an adaptive multi-agent framework for zero-shot text-to-video retrieval that dynamically orchestrates specialized agents to improve reasoning and retrieval accuracy, significantly outperforming existing methods.

Contribution

It proposes a novel multi-agent reasoning framework with dynamic coordination and communication mechanisms for improved zero-shot text-to-video retrieval.

Findings

01

Twofold improvement over CLIP4Clip

02

Significant outperformance of state-of-the-art methods

03

Effective handling of complex temporal and logical queries

Abstract

The rise of short-form video platforms and the emergence of multimodal large language models (MLLMs) have amplified the need for scalable, effective, zero-shot text-to-video retrieval systems. While recent advances in large-scale pretraining have improved zero-shot cross-modal alignment, existing methods still struggle with query-dependent temporal reasoning, limiting their effectiveness on complex queries involving temporal, logical, or causal relationships. To address these limitations, we propose an adaptive multi-agent retrieval framework that dynamically orchestrates specialized agents over multiple reasoning iterations based on the demands of each query. The framework includes: (1) a retrieval agent for scalable retrieval over large video corpora, (2) a reasoning agent for zero-shot contextual temporal reasoning, and (3) a query reformulation agent for refining ambiguous queries…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Topic Modeling