Pix2Key: Controllable Open-Vocabulary Retrieval with Semantic Decomposition and Self-Supervised Visual Dictionary Learning

Guoyizhe Wei; Yang Jiao; Nan Xi; Zhishen Huang; Jingjing Meng; Rama Chellappa; Yan Gao

arXiv:2602.22510·cs.CV·February 27, 2026

Pix2Key: Controllable Open-Vocabulary Retrieval with Semantic Decomposition and Self-Supervised Visual Dictionary Learning

Guoyizhe Wei, Yang Jiao, Nan Xi, Zhishen Huang, Jingjing Meng, Rama Chellappa, Yan Gao

PDF

Open Access

TL;DR

Pix2Key introduces a novel open-vocabulary visual dictionary approach for composed image retrieval, enhancing fine-grained attribute understanding and diversity through self-supervised learning, leading to improved retrieval accuracy.

Contribution

The paper proposes Pix2Key, a new method that uses semantic decomposition and self-supervised visual dictionary learning for better open-vocabulary image retrieval.

Findings

01

Improves Recall@10 by up to 3.2 points on DFMM-Compose.

02

Adding V-Dict-AE yields an additional 2.3-point gain.

03

Enhances intent consistency and maintains high list diversity.

Abstract

Composed Image Retrieval (CIR) uses a reference image plus a natural-language edit to retrieve images that apply the requested change while preserving other relevant visual content. Classic fusion pipelines typically rely on supervised triplets and can lose fine-grained cues, while recent zero-shot approaches often caption the reference image and merge the caption with the edit, which may miss implicit user intent and return repetitive results. We present Pix2Key, which represents both queries and candidates as open-vocabulary visual dictionaries, enabling intent-aware constraint matching and diversity-aware reranking in a unified embedding space. A self-supervised pretraining component, V-Dict-AE, further improves the dictionary representation using only images, strengthening fine-grained attribute understanding without CIR-specific supervision. On the DFMM-Compose benchmark, Pix2Key…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques