What BERT Sees: Cross-Modal Transfer for Visual Question Generation
Thomas Scialom, Patrick Bordes, Paul-Alexis Dray, Jacopo Staiano,, Patrick Gallinari

TL;DR
This paper evaluates BERT's innate ability to perform visual question generation without additional pre-training, demonstrating its adaptability to multi-modal data and its effectiveness in generating questions from visual and textual inputs.
Contribution
The paper introduces BERT-gen, a BERT-based architecture for text generation that works with mono- or multi-modal data without requiring extensive pre-training.
Findings
BERT-gen can effectively generate questions from visual and textual inputs.
The model shows strong performance on VQG datasets with limited data.
BERT's pre-trained knowledge can be leveraged for multi-modal tasks without additional pre-training.
Abstract
Pre-trained language models have recently contributed to significant advances in NLP tasks. Recently, multi-modal versions of BERT have been developed, using heavy pre-training relying on vast corpora of aligned textual and image data, primarily applied to classification tasks such as VQA. In this paper, we are interested in evaluating the visual capabilities of BERT out-of-the-box, by avoiding pre-training made on supplementary data. We choose to study Visual Question Generation, a task of great interest for grounded dialog, that enables to study the impact of each modality (as input can be visual and/or textual). Moreover, the generation aspect of the task requires an adaptation since BERT is primarily designed as an encoder. We introduce BERT-gen, a BERT-based architecture for text generation, able to leverage on either mono- or multi- modal representations. The results reported…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
MethodsLinear Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam · WordPiece · Softmax
