DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories

Chenlong Deng; Mengjie Deng; Junjie Wu; Dun Zeng; Teng Wang; Qingsong Xie; Jiadeng Huang; Shengjie Ma; Changwang Zhang; Zhaoxiang Wang; Jun Wang; Yutao Zhu; Zhicheng Dou

arXiv:2602.10809·cs.CV·February 12, 2026

DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories

Chenlong Deng, Mengjie Deng, Junjie Wu, Dun Zeng, Teng Wang, Qingsong Xie, Jiadeng Huang, Shengjie Ma, Changwang Zhang, Zhaoxiang Wang, Jun Wang, Yutao Zhu, Zhicheng Dou

PDF

Open Access 1 Datasets

TL;DR

DeepImageSearch introduces an agent-based approach for context-aware image retrieval in visual histories, emphasizing multi-step reasoning over temporal visual data, and provides a challenging benchmark with a scalable human-model collaborative pipeline.

Contribution

The paper presents a novel agentic paradigm for image retrieval in visual histories, along with DISBench benchmark and a scalable pipeline leveraging vision-language models.

Findings

01

DISBench challenges current models significantly.

02

Agent-based reasoning improves retrieval in visual histories.

03

The pipeline effectively mines spatiotemporal associations.

Abstract

Existing multimodal retrieval systems excel at semantic matching but implicitly assume that query-image relevance can be measured in isolation. This paradigm overlooks the rich dependencies inherent in realistic visual streams, where information is distributed across temporal sequences rather than confined to single snapshots. To bridge this gap, we introduce DeepImageSearch, a novel agentic paradigm that reformulates image retrieval as an autonomous exploration task. Models must plan and perform multi-step reasoning over raw visual histories to locate targets based on implicit contextual cues. We construct DISBench, a challenging benchmark built on interconnected visual data. To address the scalability challenge of creating context-dependent queries, we propose a human-model collaborative pipeline that employs vision-language models to mine latent spatiotemporal associations,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

RUC-NLPIR/DISBench
dataset· 334 dl
334 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques