Can GPT-4o mini and Gemini 2.0 Flash Predict Fine-Grained Fashion Product Attributes? A Zero-Shot Analysis
Shubham Shukla, Kunal Sonalkar

TL;DR
This study evaluates the zero-shot performance of GPT-4o-mini and Gemini 2.0 Flash on fine-grained fashion attribute recognition using images, revealing Gemini 2.0 Flash's superior performance and highlighting the need for domain-specific fine-tuning.
Contribution
It provides the first zero-shot assessment of these LLMs on fashion attribute recognition with a focus on speed and cost, using a multimodal dataset.
Findings
Gemini 2.0 Flash achieves a macro F1 score of 56.79%.
GPT-4o-mini achieves a macro F1 score of 43.28%.
Performance varies across different attribute categories.
Abstract
The fashion retail business is centered around the capacity to comprehend products. Product attribution helps in comprehending products depending on the business process. Quality attribution improves the customer experience as they navigate through millions of products offered by a retail website. It leads to well-organized product catalogs. In the end, product attribution directly impacts the 'discovery experience' of the customer. Although large language models (LLMs) have shown remarkable capabilities in understanding multimodal data, their performance on fine-grained fashion attribute recognition remains under-explored. This paper presents a zero-shot evaluation of state-of-the-art LLMs that balance performance with speed and cost efficiency, mainly GPT-4o-mini and Gemini 2.0 Flash. We have used the dataset DeepFashion-MultiModal (https://github.com/yumingj/DeepFashion-MultiModal)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
