A Free Lunch in Generating Datasets: Building a VQG and VQA System with Attention and Humans in the Loop
Jihyeon Lee, Sho Arora

TL;DR
This paper introduces a cost-effective method for building large-scale VQA datasets by generating questions with models, asking social media users, and parsing their responses, reducing reliance on traditional annotation efforts.
Contribution
It proposes a novel system that combines visual question generation with human-in-the-loop data collection, enabling scalable and inexpensive dataset expansion.
Findings
Models effectively parse clean answers from noisy human responses.
System collects large datasets at minimal cost.
Potential to improve VQA performance with scalable data gathering.
Abstract
Despite their importance in training artificial intelligence systems, large datasets remain challenging to acquire. For example, the ImageNet dataset required fourteen million labels of basic human knowledge, such as whether an image contains a chair. Unfortunately, this knowledge is so simple that it is tedious for human annotators but also tacit enough such that they are necessary. However, human collaborative efforts for tasks like labeling massive amounts of data are costly, inconsistent, and prone to failure, and this method does not resolve the issue of the resulting dataset being static in nature. What if we asked people questions they want to answer and collected their responses as data? This would mean we could gather data at a much lower cost, and expanding a dataset would simply become a matter of asking more questions. We focus on the task of Visual Question Answering (VQA)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
