UniCVR: From Alignment to Reranking for Unified Zero-Shot Composed Visual Retrieval

Haokun Wen; Xuemeng Song; Haoyu Zhang; Xiangyu Zhao; Weili Guan; Liqiang Nie

arXiv:2604.20318·cs.CV·April 23, 2026

UniCVR: From Alignment to Reranking for Unified Zero-Shot Composed Visual Retrieval

Haokun Wen, Xuemeng Song, Haoyu Zhang, Xiangyu Zhao, Weili Guan, Liqiang Nie

PDF

TL;DR

UniCVR is a unified zero-shot framework that combines multimodal language models and vision-language pre-trained models to perform composed visual retrieval across images and videos without task-specific data.

Contribution

It introduces a novel two-stage approach with contrastive learning and dual-level reranking, unifying three retrieval tasks in a zero-shot setting.

Findings

01

Achieves state-of-the-art results on five benchmarks.

02

Effectively unifies image and video retrieval tasks.

03

Demonstrates strong generalization without task-specific annotations.

Abstract

Composed image retrieval, multi-turn composed image retrieval, and composed video retrieval all share a common paradigm: composing the reference visual with modification text to retrieve the desired target. Despite this shared structure, the three tasks have been studied in isolation, with no prior work proposing a unified framework, let alone a zero-shot solution. In this paper, we propose UniCVR, the first unified zero-shot composed visual retrieval framework that jointly addresses all three tasks without any task-specific human-annotated data. UniCVR strategically combines two complementary strengths: Multimodal Large Language Models (MLLMs) for compositional query understanding and Vision-Language Pre-trained (VLP) models for structured visual retrieval. Concretely, UniCVR operates in two stages. In Stage I, we train the MLLM as a compositional query embedder via contrastive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.