GLAMI-1M: A Multilingual Image-Text Fashion Dataset
Vaclav Kosar, Anton\'in Hoskovec, Milan \v{S}ulc, Radek Bartyzal

TL;DR
GLAMI-1M is the largest multilingual fashion image-text dataset enabling advanced classification and image generation tasks, with high-quality annotations across 13 languages and 191 categories.
Contribution
This paper introduces GLAMI-1M, a large-scale, multilingual fashion dataset with high-quality annotations, and provides baseline models for classification and image generation.
Findings
Best model achieves 69.7% accuracy in classification
Dataset is suitable for text-conditioned image generation
High-quality annotations enable fine-grained classification
Abstract
We introduce GLAMI-1M: the largest multilingual image-text classification dataset and benchmark. The dataset contains images of fashion products with item descriptions, each in 1 of 13 languages. Categorization into 191 classes has high-quality annotations: all 100k images in the test set and 75% of the 1M training set were human-labeled. The paper presents baselines for image-text classification showing that the dataset presents a challenging fine-grained classification problem: The best scoring EmbraceNet model using both visual and textual features achieves 69.7% accuracy. Experiments with a modified Imagen model show the dataset is also suitable for image generation conditioned on text. The dataset, source code and model checkpoints are published at https://github.com/glami/glami-1m
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Handwritten Text Recognition Techniques · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Test · *Communicated@Fast*How Do I Communicate to Expedia? · Byte Pair Encoding · Residual Connection · Dense Connections · Layer Normalization · Linear Layer · Dropout
