Text-VQA Aug: Pipelined Harnessing of Large Multimodal Models for Automated Synthesis

Soham Joshi; Shwet Kamal Mishra; Viswanath Gopalakrishnan

arXiv:2511.02046·cs.CV·November 5, 2025

Text-VQA Aug: Pipelined Harnessing of Large Multimodal Models for Automated Synthesis

Soham Joshi, Shwet Kamal Mishra, Viswanath Gopalakrishnan

PDF

Open Access

TL;DR

This paper introduces an automated pipeline that synthesizes large-scale text-VQA datasets by leveraging multimodal foundation models, OCR, and question generation, significantly reducing manual annotation efforts.

Contribution

It presents the first end-to-end pipeline for automatic synthesis and validation of a large text-VQA dataset using multiple AI components.

Findings

01

Generated 72K QA pairs from 44K images

02

Automated validation ensures data quality

03

Scales efficiently with scene text data

Abstract

Creation of large-scale databases for Visual Question Answering tasks pertaining to the text data in a scene (text-VQA) involves skilful human annotation, which is tedious and challenging. With the advent of foundation models that handle vision and language modalities, and with the maturity of OCR systems, it is the need of the hour to establish an end-to-end pipeline that can synthesize Question-Answer (QA) pairs based on scene-text from a given image. We propose a pipeline for automated synthesis for text-VQA dataset that can produce faithful QA pairs, and which scales up with the availability of scene text data. Our proposed method harnesses the capabilities of multiple models and algorithms involving OCR detection and recognition (text spotting), region of interest (ROI) detection, caption generation, and question generation. These components are streamlined into a cohesive pipeline…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Image and Video Retrieval Techniques