PREGEN: Uncovering Latent Thoughts in Composed Video Retrieval
Gabriele Serussi, David Vainshtein, Jonathan Kouchly, Dotan Di Castro, Chaim Baskin

TL;DR
PREGEN is an efficient, zero-shot framework for composed video retrieval that leverages frozen vision-language models and lightweight encoding to outperform prior methods significantly.
Contribution
It introduces a novel approach that avoids fine-tuning VLMs, using hidden states for semantic embedding, leading to state-of-the-art results in CoVR tasks.
Findings
Surpasses prior methods with +27.23 and +69.59 in Recall@1
Demonstrates robustness across different VLM backbones
Exhibits strong zero-shot generalization to complex modifications
Abstract
Composed Video Retrieval (CoVR) aims to retrieve a video based on a query video and a modifying text. Current CoVR methods fail to fully exploit modern Vision-Language Models (VLMs), either using outdated architectures or requiring computationally expensive fine-tuning and slow caption generation. We introduce PREGEN (PRE GENeration extraction), an efficient and powerful CoVR framework that overcomes these limitations. Our approach uniquely pairs a frozen, pre-trained VLM with a lightweight encoding model, eliminating the need for any VLM fine-tuning. We feed the query video and modifying text into the VLM and extract the hidden state of the final token from each layer. A simple encoder is then trained on these pooled representations, creating a semantically rich and compact embedding for retrieval. PREGEN significantly advances the state of the art, surpassing all prior methods on…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The paper introduces an efficient and well-designed framework for composed video retrieval.
1. **Writing issues** – The long sentence in line 95 (“This approach results in embeddings that fail…”) is confusing and should be rewritten for clarity. 2. **Figures** – Figures 1 and 2 lack proper legends. For instance, the meanings of the blue, yellow, and pink circles are unclear. 3. **Architecture confusion** – In line 231, Figure 2 shows that hidden states are encoded by a *transformer encoder*, but the text mentions another *VLM* used for processing. This inconsistency should be clarified
1. The paper is well-written and clearly presented, making it easy to follow the motivation and design choices. 2. The **retrieval performance is impressive**, significantly outperforming prior methods on several CoVR benchmarks. Improve previous results from 26.79 to 96.38 on FineCVR. 3. Using frozen VLMs with a lightweight encoder is an efficient design philosophy, aligning with current trends in leveraging pre-trained multimodal models. The work provides some evidence of robustness acros
The Recall@1 = 98% result **seems unusually high**. There is no analysis on potential overfitting or data leakage, which undermines the credibility of the claim. __Can you check again if the evaluation code is correct?__ For example, test a subset from MSRVTT/MSVD. The methodological novelty is limited: the core idea mainly extends existing MLLM-based retrieval pipelines by aggregating multi-layer features rather than relying solely on the last layer. The paper lacks an efficiency analysis — i
1, The motivation behind their design is well explained. 2, Demonstrates strong performance. 3, Solution is easy and makes sense to work.
1, The multi-layer pooling to bring rich feature is intuitive, thus brings less novelty. 2, Overclaim of no-training or finetuning. l85 is misleading.The aggregated feature is then projected by training. 3, The generalisation to ood setup is not clear. Section 4.4 carry experiment on WebVid dataset. seems not a OOD scenario. 4, Written is not smooth, for example, the latent thoughts in the title is not defined or discussed.
- The problem formulation is clear. The limitations in existing studies have been well discussed, and PREGEN is proposed to address those limitations. - Experiments are well designed to support the claims. The authors show how using every layer boost the performances. Hard negative mining also brings additional improvements which has been explicitly shown in Table 3. - PREGEN achieves near 100% already in R@1.
- The main idea of using every VLM layer is somewhat simple, which can be easily thought of/ or easily tried as the first attempt. Specifically, PREGEN with avg. over encoder outputs already achieves significant improvements. While this is a nice finding, it also shows that the proposed scheme does not require technical efforts (less technical challenge). It is also surprising that 1 layer PREGEN performs poorly compared to previous approaches with 1 layer, which raises many questions. - While
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
