Sketch and Text Synergy: Fusing Structural Contours and Descriptive Attributes for Fine-Grained Image Retrieval

Siyuan Wang; Hanchen Gao; Guangming Zhu; Jiang Lu; Yiyue Ma; Tianci Wu; Jincai Huang; Liang Zhang

arXiv:2604.15735·cs.CV·April 20, 2026

Sketch and Text Synergy: Fusing Structural Contours and Descriptive Attributes for Fine-Grained Image Retrieval

Siyuan Wang, Hanchen Gao, Guangming Zhu, Jiang Lu, Yiyue Ma, Tianci Wu, Jincai Huang, Liang Zhang

PDF

TL;DR

This paper introduces STBIR, a framework that combines sketches and text to improve fine-grained image retrieval by leveraging their complementary features.

Contribution

It proposes a novel multi-module framework with curriculum learning, category-knowledge optimization, and cross-modal alignment, along with a new benchmark dataset.

Findings

01

STBIR outperforms existing methods in fine-grained retrieval tasks.

02

The curriculum learning module improves robustness to query quality.

03

The dataset supports future research in sketch-text image retrieval.

Abstract

Fine-grained image retrieval via hand-drawn sketches or textual descriptions remains a critical challenge due to inherent modality gaps. While hand-drawn sketches capture complex structural contours, they lack color and texture, which text effectively provides despite omitting spatial contours. Motivated by the complementary nature of these modalities, we propose the Sketch and Text Based Image Retrieval (STBIR) framework. By synergizing the rich color and texture cues from text with the structural outlines provided by sketches, STBIR achieves superior fine-grained retrieval performance. First, a curriculum learning driven robustness enhancement module is proposed to enhance the model's robustness when handling queries of varying quality. Second, we introduce a category-knowledge-based feature space optimization module, thereby significantly boosting the model's representational power.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.