Cross-Modal Attention Alignment Network with Auxiliary Text Description   for zero-shot sketch-based image retrieval

Hanwen Su; Ge Song; Kai Huang; Jiyan Wang; Ming Yang

arXiv:2407.00979·cs.CV·July 2, 2024

Cross-Modal Attention Alignment Network with Auxiliary Text Description for zero-shot sketch-based image retrieval

Hanwen Su, Ge Song, Kai Huang, Jiyan Wang, Ming Yang

PDF

Open Access

TL;DR

This paper introduces a novel zero-shot sketch-based image retrieval method leveraging large-scale language models to generate auxiliary text descriptions, enabling better cross-modal alignment and improved retrieval performance.

Contribution

The approach uniquely integrates LLM-generated textual descriptions with visual data, enhancing zero-shot generalization in sketch-based image retrieval.

Findings

01

Outperforms state-of-the-art ZS-SBIR methods on three benchmarks.

02

Effectively leverages LLM-generated descriptions for cross-modal alignment.

03

Demonstrates superior zero-shot retrieval accuracy.

Abstract

In this paper, we study the problem of zero-shot sketch-based image retrieval (ZS-SBIR). The prior methods tackle the problem in a two-modality setting with only category labels or even no textual information involved. However, the growing prevalence of Large-scale pre-trained Language Models (LLMs), which have demonstrated great knowledge learned from web-scale data, can provide us with an opportunity to conclude collective textual information. Our key innovation lies in the usage of text data as auxiliary information for images, thus leveraging the inherent zero-shot generalization ability that language offers. To this end, we propose an approach called Cross-Modal Attention Alignment Network with Auxiliary Text Description for zero-shot sketch-based image retrieval. The network consists of three components: (i) a Description Generation Module that generates textual descriptions for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques

MethodsSoftmax · Attention Is All You Need · ALIGN