An analysis of vision-language models for fabric retrieval
Francesco Giuliari, Asif Khan Pattan, Mohamed Lamine Mekhalfi, Fabio Poiesi

TL;DR
This paper explores how vision-language models can improve fabric retrieval by using structured descriptions generated via multimodal large language models, highlighting the importance of domain-specific adaptations.
Contribution
Introduces an automated annotation pipeline using MLLMs to generate structured descriptions for fabric retrieval, and evaluates multiple VLMs in a specialized domain.
Findings
Structured descriptions improve retrieval accuracy.
Perception Encoder outperforms other models.
Zero-shot retrieval remains challenging in fine-grained domains.
Abstract
Effective cross-modal retrieval is essential for applications like information retrieval and recommendation systems, particularly in specialized domains such as manufacturing, where product information often consists of visual samples paired with a textual description. This paper investigates the use of Vision Language Models(VLMs) for zero-shot text-to-image retrieval on fabric samples. We address the lack of publicly available datasets by introducing an automated annotation pipeline that uses Multimodal Large Language Models (MLLMs) to generate two types of textual descriptions: freeform natural language and structured attribute-based descriptions. We produce these descriptions to evaluate retrieval performance across three Vision-Language Models: CLIP, LAION-CLIP, and Meta's Perception Encoder. Our experiments demonstrate that structured, attribute-rich descriptions significantly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Language-Image Pre-training
