Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining
Emanuele Bugliarello, Aida Nematzadeh, Lisa Anne Hendricks

TL;DR
This paper introduces two weakly-supervised pretraining methods that leverage small-scale visual relation data to improve multimodal representations in vision-and-language tasks, demonstrating effectiveness in zero-shot evaluations.
Contribution
It proposes novel approaches using verbalised scene graphs and masked relation prediction to incorporate weak relation supervision into multimodal pretraining.
Findings
Improved zero-shot performance on coarse and fine-grained tasks.
Effective use of small-scale relation data enhances multimodal representations.
Methods outperform baseline models pretrained on large Web data.
Abstract
Recent work in vision-and-language pretraining has investigated supervised signals from object detection data to learn better, fine-grained multimodal representations. In this work, we take a step further and explore how we can tap into supervision from small-scale visual relation data. In particular, we propose two pretraining approaches to contextualise visual entities in a multimodal setup. With verbalised scene graphs, we transform visual relation triplets into structured captions, and treat them as additional image descriptions. With masked relation prediction, we further encourage relating entities from image regions with visually masked contexts. When applied to strong baselines pretrained on large amounts of Web data, zero-shot evaluations on both coarse-grained and fine-grained tasks show the efficacy of our methods in learning multimodal representations from weakly-supervised…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
