Finding Structural Knowledge in Multimodal-BERT

Victor Milewski; Miryam de Lhoneux; Marie-Francine Moens

arXiv:2203.09306·cs.CL·March 18, 2022·1 cites

Finding Structural Knowledge in Multimodal-BERT

Victor Milewski, Miryam de Lhoneux, Marie-Francine Moens

PDF

Open Access 1 Repo

TL;DR

This paper investigates whether multimodal-BERT models encode grammatical and visual object structures, finding that they do not explicitly store these scene trees despite their multimodal training.

Contribution

The study introduces explicit scene trees for language and visual data and probes multimodal-BERT to assess their encoding of these structures, revealing limitations.

Findings

01

Multimodal-BERT models do not encode scene trees.

02

Explicit scene trees clarify the structure of language and visuals.

03

Probing shows limited structural knowledge in embeddings.

Abstract

In this work, we investigate the knowledge learned in the embeddings of multimodal-BERT models. More specifically, we probe their capabilities of storing the grammatical structure of linguistic data and the structure learned over objects in visual data. To reach that goal, we first make the inherent structure of language and visuals explicit by a dependency parse of the sentences that describe the image and by the dependencies between the object regions in the image, respectively. We call this explicit visual structure the \textit{scene tree}, that is based on the dependency tree of the language description. Extensive probing experiments show that the multimodal-BERT models do not encode these scene trees.Code available at \url{https://github.com/VSJMilewski/multimodal-probes}.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vsjmilewski/multimodal-probes
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling