TL;DR
This paper introduces a framework for probing visual-semantic embeddings to understand their linguistic properties, revealing that multimodal embeddings better capture combined image and text information than unimodal ones.
Contribution
It formalizes probing tasks for image-caption embeddings, enabling analysis of their linguistic properties and comparison of different models.
Findings
Visual-semantic embeddings outperform unimodal embeddings by up to 12% in probing tasks.
Proposed framework reveals complementary information in text and image modalities.
Analysis tools help understand the inner workings of multimodal embeddings.
Abstract
Semantic embeddings have advanced the state of the art for countless natural language processing tasks, and various extensions to multimodal domains, such as visual-semantic embeddings, have been proposed. While the power of visual-semantic embeddings comes from the distillation and enrichment of information through machine learning, their inner workings are poorly understood and there is a shortage of analysis tools. To address this problem, we generalize the notion of probing tasks to the visual-semantic case. To this end, we (i) discuss the formalization of probing tasks for embeddings of image-caption pairs, (ii) define three concrete probing tasks within our general framework, (iii) train classifiers to probe for those properties, and (iv) compare various state-of-the-art embeddings under the lens of the proposed probing tasks. Our experiments reveal an up to 12% increase in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
