From Pixels to Posts: Retrieval-Augmented Fashion Captioning and Hashtag Generation
Moazzam Umer Gondal, Hamad Ul Qudous, Daniya Siddiqui, Asma Ahmad Farhan

TL;DR
This paper presents a retrieval-augmented framework that combines multi-garment detection, attribute reasoning, and large language models to generate accurate, stylistically rich fashion captions and hashtags, improving factual grounding and domain generalization.
Contribution
It introduces a novel retrieval-augmented pipeline for fashion captioning and hashtagging that enhances attribute fidelity and interpretability over traditional end-to-end models.
Findings
YOLO detector achieves [email protected] of 0.71 for garment detection
RAG-LLM pipeline attains 0.80 attribute coverage in captioning
Retrieval-augmented approach reduces hallucination and improves factual grounding
Abstract
This paper introduces the retrieval-augmented framework for automatic fashion caption and hashtag generation, combining multi-garment detection, attribute reasoning, and Large Language Model (LLM) prompting. The system aims to produce visually grounded, descriptive, and stylistically interesting text for fashion imagery, overcoming the limitations of end-to-end captioners that have problems with attribute fidelity and domain generalization. The pipeline combines a YOLO-based detector for multi-garment localization, k-means clustering for dominant color extraction, and a CLIP-FAISS retrieval module for fabric and gender attribute inference based on a structured product index. These attributes, together with retrieved style examples, create a factual evidence pack that is used to guide an LLM to generate human-like captions and contextually rich hashtags. A fine-tuned BLIP model is used…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis
