FashionVQA: A Domain-Specific Visual Question Answering System
Min Wang, Ata Mahjoubfar, Anupama Joshi

TL;DR
This paper introduces a large-scale, domain-specific visual question answering system for fashion images, leveraging a massive dataset and transformer models to surpass human accuracy in answering complex, natural language questions about apparel.
Contribution
The paper presents a novel approach to creating a large, challenging fashion VQA dataset and demonstrates that transformer-based models trained from scratch can outperform existing methods and even human experts.
Findings
Transformer models achieve maximum accuracy with shared encoding and decoding.
The best model surpasses human expert performance on the dataset.
A large-scale, domain-specific dataset enables training of specialized visual language models.
Abstract
Humans apprehend the world through various sensory modalities, yet language is their predominant communication channel. Machine learning systems need to draw on the same multimodal richness to have informed discourses with humans in natural language; this is particularly true for systems specialized in visually-dense information, such as dialogue, recommendation, and search engines for clothing. To this end, we train a visual question answering (VQA) system to answer complex natural language questions about apparel in fashion photoshoot images. The key to the successful training of our VQA model is the automatic creation of a visual question-answering dataset with 168 million samples from item attributes of 207 thousand images using diverse templates. The sample generation employs a strategy that considers the difficulty of the question-answer pairs to emphasize challenging concepts.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
