Fashion Florence: Fine-Tuning Florence-2 for Structured Fashion Attribute Extraction

Anushree Berlia

arXiv:2605.09827·cs.CV·May 12, 2026

Fashion Florence: Fine-Tuning Florence-2 for Structured Fashion Attribute Extraction

Anushree Berlia

PDF

TL;DR

Fashion Florence is a fine-tuned Florence-2 model that accurately extracts structured fashion attributes from images, enabling improved recommendation and retrieval applications with high JSON validity and efficiency.

Contribution

The paper introduces Fashion Florence, a fine-tuned Florence-2 model using LoRA for structured fashion attribute extraction from images, achieving high accuracy and JSON validity.

Findings

01

94.6% category accuracy on test set

02

99.8% valid JSON output rate

03

Style tag F1 score of 0.753

Abstract

We present Fashion Florence, a Florence-2 vision-language model fine-tuned with LoRA to extract structured fashion attributes from clothing images. Given a single photograph, the model generates a JSON object containing category, color, material, style tags, and occasion tags, structured output suitable for direct programmatic consumption by downstream recommendation and retrieval systems. Fine-tuning data is derived from the iMaterialist Fashion dataset (228 labels), where we collapse fine-grained annotations into a compact 6-category, 16-color, 19-style schema via rule-based label engineering. We apply LoRA (r=16, alpha=32) to all decoder linear layers, training for 3 epochs on 3,688 examples. On a held-out test set of 461 images, Fashion Florence achieves 94.6% category accuracy and 63.0% material accuracy, compared to 89.3% / 43.3% for GPT-4o-mini and 87.4% for Gemini 2.5 Flash.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.