SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs

Xin Su; Man Luo; Kris W Pan; Tien Pei Chou; Vasudev Lal; Phillip Howard

arXiv:2406.19593·cs.CL·June 11, 2025

SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs

Xin Su, Man Luo, Kris W Pan, Tien Pei Chou, Vasudev Lal, Phillip Howard

PDF

Open Access 1 Datasets

TL;DR

SK-VQA introduces a large-scale synthetic multimodal dataset with over 2 million visual question-answer pairs, enhancing training and evaluation of context-augmented multimodal language models for knowledge-based visual question answering.

Contribution

The paper presents SK-VQA, a novel synthetic dataset that significantly expands the scale, diversity, and domain coverage for training and benchmarking context-augmented multimodal LLMs.

Findings

01

Models trained on SK-VQA show improved generalization in context-aware VQA.

02

SK-VQA serves as both a challenging benchmark and an effective training resource.

03

Human evaluation confirms the high quality and contextual relevance of the dataset.

Abstract

Multimodal retrieval augmented generation (RAG) plays a crucial role in domains such as knowledge-based visual question answering (KB-VQA), where external knowledge is needed to answer a question. However, existing multimodal LLMs (MLLMs) are not designed for context-augmented generation, limiting their effectiveness in such tasks. While synthetic data generation has recently gained attention for training MLLMs, its application for context-augmented generation remains underexplored. To address this gap, we introduce SK-VQA, a large-scale synthetic multimodal dataset containing over 2 million visual question-answer pairs, each associated with context documents containing information necessary to determine the final answer. Compared to previous datasets, SK-VQA contains 11x more unique questions, exhibits greater domain diversity, and covers a broader spectrum of image sources. Through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Intel/SK-VQA
dataset· 372 dl
372 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling

MethodsSoftmax · Attention Is All You Need