Finding Structural Knowledge in Multimodal-BERT
Victor Milewski, Miryam de Lhoneux, Marie-Francine Moens

TL;DR
This paper investigates whether multimodal-BERT models encode grammatical and visual object structures, finding that they do not explicitly store these scene trees despite their multimodal training.
Contribution
The study introduces explicit scene trees for language and visual data and probes multimodal-BERT to assess their encoding of these structures, revealing limitations.
Findings
Multimodal-BERT models do not encode scene trees.
Explicit scene trees clarify the structure of language and visuals.
Probing shows limited structural knowledge in embeddings.
Abstract
In this work, we investigate the knowledge learned in the embeddings of multimodal-BERT models. More specifically, we probe their capabilities of storing the grammatical structure of linguistic data and the structure learned over objects in visual data. To reach that goal, we first make the inherent structure of language and visuals explicit by a dependency parse of the sentences that describe the image and by the dependencies between the object regions in the image, respectively. We call this explicit visual structure the \textit{scene tree}, that is based on the dependency tree of the language description. Extensive probing experiments show that the multimodal-BERT models do not encode these scene trees.Code available at \url{https://github.com/VSJMilewski/multimodal-probes}.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling
