SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs
Xin Su, Man Luo, Kris W Pan, Tien Pei Chou, Vasudev Lal, Phillip Howard

TL;DR
SK-VQA introduces a large-scale synthetic multimodal dataset with over 2 million visual question-answer pairs, enhancing training and evaluation of context-augmented multimodal language models for knowledge-based visual question answering.
Contribution
The paper presents SK-VQA, a novel synthetic dataset that significantly expands the scale, diversity, and domain coverage for training and benchmarking context-augmented multimodal LLMs.
Findings
Models trained on SK-VQA show improved generalization in context-aware VQA.
SK-VQA serves as both a challenging benchmark and an effective training resource.
Human evaluation confirms the high quality and contextual relevance of the dataset.
Abstract
Multimodal retrieval augmented generation (RAG) plays a crucial role in domains such as knowledge-based visual question answering (KB-VQA), where external knowledge is needed to answer a question. However, existing multimodal LLMs (MLLMs) are not designed for context-augmented generation, limiting their effectiveness in such tasks. While synthetic data generation has recently gained attention for training MLLMs, its application for context-augmented generation remains underexplored. To address this gap, we introduce SK-VQA, a large-scale synthetic multimodal dataset containing over 2 million visual question-answer pairs, each associated with context documents containing information necessary to determine the final answer. Compared to previous datasets, SK-VQA contains 11x more unique questions, exhibits greater domain diversity, and covers a broader spectrum of image sources. Through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling
MethodsSoftmax · Attention Is All You Need
