Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining

Emanuele Bugliarello; Aida Nematzadeh; Lisa Anne Hendricks

arXiv:2305.14281·cs.CL·October 20, 2023·1 cites

Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining

Emanuele Bugliarello, Aida Nematzadeh, Lisa Anne Hendricks

PDF

Open Access 1 Repo

TL;DR

This paper introduces two weakly-supervised pretraining methods that leverage small-scale visual relation data to improve multimodal representations in vision-and-language tasks, demonstrating effectiveness in zero-shot evaluations.

Contribution

It proposes novel approaches using verbalised scene graphs and masked relation prediction to incorporate weak relation supervision into multimodal pretraining.

Findings

01

Improved zero-shot performance on coarse and fine-grained tasks.

02

Effective use of small-scale relation data enhances multimodal representations.

03

Methods outperform baseline models pretrained on large Web data.

Abstract

Recent work in vision-and-language pretraining has investigated supervised signals from object detection data to learn better, fine-grained multimodal representations. In this work, we take a step further and explore how we can tap into supervision from small-scale visual relation data. In particular, we propose two pretraining approaches to contextualise visual entities in a multimodal setup. With verbalised scene graphs, we transform visual relation triplets into structured captions, and treat them as additional image descriptions. With masked relation prediction, we further encourage relating entities from image regions with visually masked contexts. When applied to strong baselines pretrained on large amounts of Web data, zero-shot evaluations on both coarse-grained and fine-grained tasks show the efficacy of our methods in learning multimodal representations from weakly-supervised…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

e-bug/weak-relation-vlm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition