From Pixels to Posts: Retrieval-Augmented Fashion Captioning and Hashtag Generation

Moazzam Umer Gondal; Hamad Ul Qudous; Daniya Siddiqui; Asma Ahmad Farhan

arXiv:2511.19149·cs.CV·November 25, 2025

From Pixels to Posts: Retrieval-Augmented Fashion Captioning and Hashtag Generation

Moazzam Umer Gondal, Hamad Ul Qudous, Daniya Siddiqui, Asma Ahmad Farhan

PDF

Open Access

TL;DR

This paper presents a retrieval-augmented framework that combines multi-garment detection, attribute reasoning, and large language models to generate accurate, stylistically rich fashion captions and hashtags, improving factual grounding and domain generalization.

Contribution

It introduces a novel retrieval-augmented pipeline for fashion captioning and hashtagging that enhances attribute fidelity and interpretability over traditional end-to-end models.

Findings

01

YOLO detector achieves [email protected] of 0.71 for garment detection

02

RAG-LLM pipeline attains 0.80 attribute coverage in captioning

03

Retrieval-augmented approach reduces hallucination and improves factual grounding

Abstract

This paper introduces the retrieval-augmented framework for automatic fashion caption and hashtag generation, combining multi-garment detection, attribute reasoning, and Large Language Model (LLM) prompting. The system aims to produce visually grounded, descriptive, and stylistically interesting text for fashion imagery, overcoming the limitations of end-to-end captioners that have problems with attribute fidelity and domain generalization. The pipeline combines a YOLO-based detector for multi-garment localization, k-means clustering for dominant color extraction, and a CLIP-FAISS retrieval module for fabric and gender attribute inference based on a structured product index. These attributes, together with retrieved style examples, create a factual evidence pack that is used to guide an LLM to generate human-like captions and contextually rich hashtags. A fine-tuned BLIP model is used…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis