Seeing What Tastes Good: Revisiting Multimodal Distributional Semantics in the Billion Parameter Era
Dan Oneata, Desmond Elliott, Stella Frank

TL;DR
This study evaluates how well large-scale multimodal and unimodal models capture semantic features of concrete objects, revealing their strengths and limitations in representing sensory and encyclopedic attributes.
Contribution
It provides a comparative analysis of image-only, multimodal, and language models in encoding semantic object features using probing tasks and norm datasets.
Findings
Multimodal image encoders slightly outperform language-only models.
Image-only encoders perform comparably to language models on various attributes.
Models capture sensory and encyclopedic features with varying degrees of accuracy.
Abstract
Human learning and conceptual representation is grounded in sensorimotor experience, in contrast to state-of-the-art foundation models. In this paper, we investigate how well such large-scale models, trained on vast quantities of data, represent the semantic feature norms of concrete object concepts, e.g. a ROSE is red, smells sweet, and is a flower. More specifically, we use probing tasks to test which properties of objects these models are aware of. We evaluate image encoders trained on image data alone, as well as multimodally-trained image encoders and language-only models, on predicting an extended denser version of the classic McRae norms and the newer Binder dataset of attribute ratings. We find that multimodal image encoders slightly outperform language-only approaches, and that image-only encoders perform comparably to the language models, even on non-visual attributes that are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Embodied and Extended Cognition · Face Recognition and Perception
