Can Visual Language Models Replace OCR-Based Visual Question Answering   Pipelines in Production? A Case Study in Retail

Bianca Lamm; Janis Keuper

arXiv:2408.15626·cs.CV·August 29, 2024

Can Visual Language Models Replace OCR-Based Visual Question Answering Pipelines in Production? A Case Study in Retail

Bianca Lamm, Janis Keuper

PDF

Open Access

TL;DR

This study evaluates whether pre-trained vision language models can replace traditional OCR-based pipelines for visual question answering in retail, highlighting their strengths and limitations in real-world scenarios.

Contribution

The paper provides a comprehensive analysis of VLMs' performance in retail VQA tasks, comparing commercial and open-source models in a production context.

Findings

01

VLMs perform well on brand and price questions

02

VLMs struggle with fine-grained product identification

03

Performance varies significantly depending on the task

Abstract

Most production-level deployments for Visual Question Answering (VQA) tasks are still build as processing pipelines of independent steps including image pre-processing, object- and text detection, Optical Character Recognition (OCR) and (mostly supervised) object classification. However, the recent advances in vision Foundation Models [25] and Vision Language Models (VLMs) [23] raise the question if these custom trained, multi-step approaches can be replaced with pre-trained, single-step VLMs. This paper analyzes the performance and limits of various VLMs in the context of VQA and OCR [5, 9, 12] tasks in a production-level scenario. Using data from the Retail-786k [10] dataset, we investigate the capabilities of pre-trained VLMs to answer detailed questions about advertised products in images. Our study includes two commercial models, GPT-4V [16] and GPT-4o [17], as well as four…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques · Web Data Mining and Analysis