A Multimodal, Multitask System for Generating E Commerce Text Listings from Images
Nayan Kumar Singh

TL;DR
This paper introduces a multi-task, hierarchical vision-to-text system for generating accurate e-commerce product descriptions from images, reducing factual errors and improving efficiency over existing models.
Contribution
It proposes a novel multi-task learning architecture with hierarchical generation, enhancing factual accuracy and reducing latency in image-to-text e-commerce listings.
Findings
Hierarchical generation reduces factual hallucinations by 44.5%.
Multi-task learning improves attribute classification by 6.6%.
Model is 3.5 times faster than comparable vision-to-language models.
Abstract
Manually generating catchy descriptions and names is labor intensive and a slow process for retailers. Although generative AI provides an automation solution in form of Vision to Language Models (VLM), the current VLMs are prone to factual "hallucinations". Siloed, single task models are not only inefficient but also fail to capture interdependent relationships between features. To address these challenges, we propose an end to end, multi task system that generates factually grounded textual listings from a single image. The contributions of this study are two proposals for the model architecture. First, application of multi task learning approach for fine tuning a vision encoder where a single vision backbone is jointly trained on attribute prediction such as color, hemline and neck style and price regression. Second, introduction of a hierarchical generation process where the model's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
