TL;DR
CIRThan is a new dataset for sketch+text composed image retrieval in Thangka art, highlighting challenges in aligning multimodal inputs with complex, domain-specific imagery.
Contribution
The paper introduces CIRThan, a culturally grounded dataset with hierarchical descriptions for Thangka images, and evaluates existing methods, exposing their limitations in this domain.
Findings
Existing CIR methods struggle with fine-grained, domain-specific Thangka images.
Hierarchical textual descriptions improve semantic understanding in retrieval.
Zero-shot methods perform poorly without in-domain supervision.
Abstract
Composed Image Retrieval (CIR) enables image retrieval by combining multiple query modalities, but existing benchmarks predominantly focus on general-domain imagery and rely on reference images with short textual modifications. As a result, they provide limited support for retrieval scenarios that require fine-grained semantic reasoning, structured visual understanding, and domain-specific knowledge. In this work, we introduce CIRThan, a sketch+text Composed Image Retrieval dataset for Thangka imagery, a culturally grounded and knowledge-specific visual domain characterized by complex structures, dense symbolic elements, and domain-dependent semantic conventions. CIRThan contains 2,287 high-quality Thangka images, each paired with a human-drawn sketch and hierarchical textual descriptions at three semantic levels, enabling composed queries that jointly express structural intent and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
