X-Aligner: Composed Visual Retrieval without the Bells and Whistles

Yuqian Zheng; Mariana-Iuliana Georgescu

arXiv:2601.16582·cs.CV·January 26, 2026

X-Aligner: Composed Visual Retrieval without the Bells and Whistles

Yuqian Zheng, Mariana-Iuliana Georgescu

PDF

Open Access

TL;DR

X-Aligner introduces a novel cross-attention based framework leveraging Vision Language Models for improved composed video retrieval, achieving state-of-the-art results and strong zero-shot generalization.

Contribution

The paper proposes X-Aligner, a new multimodal fusion framework with a progressive cross-attention module and two-stage training, enhancing video retrieval performance.

Findings

01

Achieves 63.93% Recall@1 on Webvid-CoVR-Test

02

Outperforms existing CoVR methods in retrieval accuracy

03

Demonstrates strong zero-shot generalization on CIR datasets

Abstract

Composed Video Retrieval (CoVR) facilitates video retrieval by combining visual and textual queries. However, existing CoVR frameworks typically fuse multimodal inputs in a single stage, achieving only marginal gains over initial baseline. To address this, we propose a novel CoVR framework that leverages the representational power of Vision Language Models (VLMs). Our framework incorporates a novel cross-attention module X-Aligner, composed of cross-attention layers that progressively fuse visual and textual inputs and align their multimodal representation with that of the target video. To further enhance the representation of the multimodal query, we incorporate the caption of the visual query as an additional input. The framework is trained in two stages to preserve the pretrained VLM representation. In the first stage, only the newly introduced module is trained, while in the second…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques