Reasoning Text-to-Video Retrieval via Digital Twin Video Representations and Large Language Models

Yiqing Shen; Chenxiao Fan; Chenjia Li; Mathias Unberath

arXiv:2511.12371·cs.CV·November 18, 2025

Reasoning Text-to-Video Retrieval via Digital Twin Video Representations and Large Language Models

Yiqing Shen, Chenxiao Fan, Chenjia Li, Mathias Unberath

PDF

Open Access

TL;DR

This paper introduces a reasoning-based text-to-video retrieval method using digital twin scene representations and large language models, enabling implicit query understanding and object-level grounding, significantly improving retrieval accuracy.

Contribution

It proposes a novel digital twin video representation combined with a two-stage reasoning framework for implicit query retrieval, outperforming existing methods and establishing new benchmarks.

Findings

01

Achieves 81.2% R@1 on ReasonT2VBench-135, surpassing baselines by over 50%

02

Maintains high performance with 81.7% R@1 on extended datasets

03

Sets new state-of-the-art results on MSR-VTT, MSVD, and VATEX benchmarks.

Abstract

The goal of text-to-video retrieval is to search large databases for relevant videos based on text queries. Existing methods have progressed to handling explicit queries where the visual content of interest is described explicitly; however, they fail with implicit queries where identifying videos relevant to the query requires reasoning. We introduce reasoning text-to-video retrieval, a paradigm that extends traditional retrieval to process implicit queries through reasoning while providing object-level grounding masks that identify which entities satisfy the query conditions. Instead of relying on vision-language models directly, we propose representing video content as digital twins, i.e., structured scene representations that decompose salient objects through specialist vision models. This approach is beneficial because it enables large language models to reason directly over…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization