ContextBLIP: Doubly Contextual Alignment for Contrastive Image Retrieval   from Linguistically Complex Descriptions

Honglin Lin; Siyu Li; Guoshun Nan; Chaoyue Tang; Xueting Wang; Jingxin; Xu; Rong Yankai; Zhili Zhou; Yutong Gao; Qimei Cui; Xiaofeng Tao

arXiv:2405.19226·cs.CV·May 30, 2024

ContextBLIP: Doubly Contextual Alignment for Contrastive Image Retrieval from Linguistically Complex Descriptions

Honglin Lin, Siyu Li, Guoshun Nan, Chaoyue Tang, Xueting Wang, Jingxin, Xu, Rong Yankai, Zhili Zhou, Yutong Gao, Qimei Cui, Xiaofeng Tao

PDF

Open Access

TL;DR

ContextBLIP introduces a novel doubly contextual alignment approach for challenging image retrieval from complex textual descriptions, significantly improving alignment of subtle cues and outperforming existing models with fewer parameters.

Contribution

The paper proposes a simple yet effective method combining intra- and inter-contextual alignment to enhance image retrieval from complex descriptions, with a multi-scale adapter and novel loss functions.

Findings

01

Achieves comparable results to GPT-4V with far fewer parameters.

02

Effectively highlights focal patches and aligns nuanced cues in both modalities.

03

Outperforms existing methods on benchmark datasets.

Abstract

Image retrieval from contextual descriptions (IRCD) aims to identify an image within a set of minimally contrastive candidates based on linguistically complex text. Despite the success of VLMs, they still significantly lag behind human performance in IRCD. The main challenges lie in aligning key contextual cues in two modalities, where these subtle cues are concealed in tiny areas of multiple contrastive images and within the complex linguistics of textual descriptions. This motivates us to propose ContextBLIP, a simple yet effective method that relies on a doubly contextual alignment scheme for challenging IRCD. Specifically, 1) our model comprises a multi-scale adapter, a matching loss, and a text-guided masking loss. The adapter learns to capture fine-grained visual cues. The two losses enable iterative supervision for the adapter, gradually highlighting the focal patches of a single…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques

MethodsSparse Evolutionary Training · Adapter