An analysis of vision-language models for fabric retrieval

Francesco Giuliari; Asif Khan Pattan; Mohamed Lamine Mekhalfi; Fabio Poiesi

arXiv:2507.04735·cs.CV·July 8, 2025

An analysis of vision-language models for fabric retrieval

Francesco Giuliari, Asif Khan Pattan, Mohamed Lamine Mekhalfi, Fabio Poiesi

PDF

TL;DR

This paper explores how vision-language models can improve fabric retrieval by using structured descriptions generated via multimodal large language models, highlighting the importance of domain-specific adaptations.

Contribution

Introduces an automated annotation pipeline using MLLMs to generate structured descriptions for fabric retrieval, and evaluates multiple VLMs in a specialized domain.

Findings

01

Structured descriptions improve retrieval accuracy.

02

Perception Encoder outperforms other models.

03

Zero-shot retrieval remains challenging in fine-grained domains.

Abstract

Effective cross-modal retrieval is essential for applications like information retrieval and recommendation systems, particularly in specialized domains such as manufacturing, where product information often consists of visual samples paired with a textual description. This paper investigates the use of Vision Language Models(VLMs) for zero-shot text-to-image retrieval on fabric samples. We address the lack of publicly available datasets by introducing an automated annotation pipeline that uses Multimodal Large Language Models (MLLMs) to generate two types of textual descriptions: freeform natural language and structured attribute-based descriptions. We produce these descriptions to evaluate retrieval performance across three Vision-Language Models: CLIP, LAION-CLIP, and Meta's Perception Encoder. Our experiments demonstrate that structured, attribute-rich descriptions significantly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsContrastive Language-Image Pre-training