ShotFinder: Imagination-Driven Open-Domain Video Shot Retrieval via Web Search

Tao Yu; Haopeng Jin; Hao Wang; Shenghua Chai; Yujia Yang; Junhao Gong; Jiaming Guo; Minghui Zhang; Xinlong Chen; Zhenghao Zhang; Yuxuan Zhou; Yufei Xiong; Shanbin Zhang; Jiabing Yang; Hongzhu Yi; Xinming Wang; Cheng Zhong; Xiao Ma; Zhang Zhang; Yan Huang; Liang Wang

arXiv:2601.23232·cs.CV·February 17, 2026

ShotFinder: Imagination-Driven Open-Domain Video Shot Retrieval via Web Search

Tao Yu, Haopeng Jin, Hao Wang, Shenghua Chai, Yujia Yang, Junhao Gong, Jiaming Guo, Minghui Zhang, Xinlong Chen, Zhenghao Zhang, Yuxuan Zhou, Yufei Xiong, Shanbin Zhang, Jiabing Yang, Hongzhu Yi, Xinming Wang, Cheng Zhong, Xiao Ma, Zhang Zhang, Yan Huang, Liang Wang

PDF

Open Access 1 Datasets

TL;DR

ShotFinder introduces a new benchmark and method for open-domain video shot retrieval using web search, leveraging large models for query expansion and localization, revealing significant challenges in current models.

Contribution

The paper presents a novel benchmark and a three-stage retrieval pipeline for open-domain video shot retrieval, addressing the lack of systematic evaluation and analysis in this area.

Findings

01

Significant gap between current models and human performance.

02

Temporal localization is more manageable than color and style constraints.

03

Challenges remain in achieving balanced retrieval across different constraints.

Abstract

In recent years, large language models (LLMs) have made rapid progress in information retrieval, yet existing research has mainly focused on text or static multimodal settings. Open-domain video shot retrieval, which involves richer temporal structure and more complex semantics, still lacks systematic benchmarks and analysis. To fill this gap, we introduce ShotFinder, a benchmark that formalizes editing requirements as keyframe-oriented shot descriptions and introduces five types of controllable single-factor constraints: Temporal order, Color, Visual style, Audio, and Resolution. We curate 1,210 high-quality samples from YouTube across 20 thematic categories, using large models for generation with human verification. Based on the benchmark, we propose ShotFinder, a text-driven three-stage retrieval and localization pipeline: (1) query expansion via video imagination, (2) candidate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Kirito-Lab/ShotFinder
dataset· 1.2k dl
1.2k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques