TL;DR
This paper introduces Visual Question Generation (VQG), a new task where systems generate natural questions about images, emphasizing commonsense inference and abstract events, supported by new datasets and baseline models.
Contribution
The paper proposes the novel VQG task, provides diverse datasets, and evaluates generative and retrieval models, highlighting the gap with human performance and encouraging further research.
Findings
Models generate reasonable questions but lag behind humans.
New datasets cover object-centric to event-centric images.
Significant potential for integrating commonsense knowledge.
Abstract
There has been an explosion of work in the vision & language community during the past few years from image captioning to video transcription, and answering questions about images. These tasks have focused on literal descriptions of the image. To move beyond the literal, we choose to explore how questions about an image are often directed at commonsense inference and the abstract events evoked by objects in the image. In this paper, we introduce the novel task of Visual Question Generation (VQG), where the system is tasked with asking a natural and engaging question when shown an image. We provide three datasets which cover a variety of images from object-centric to event-centric, with considerably more abstract training data than provided to state-of-the-art captioning systems thus far. We train and test several generative and retrieval models to tackle the task of VQG. Evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
