Individuation in Neural Models with and without Visual Grounding
Alexey Tikhonov, Lisa Bylinina, Ivan P. Yamshchikov

TL;DR
This paper compares how a vision-and-language model and text-only models encode individuation information, showing that CLIP captures hierarchical distinctions more effectively and aligns with linguistic and cognitive hierarchies.
Contribution
It demonstrates that CLIP embeddings encode individuation hierarchies more accurately than text-only models and aligns with linguistic and cognitive theories.
Findings
CLIP captures quantitative differences in individuation better than text-only models
CLIP embeddings agree with hierarchies in linguistics and cognitive science
Vision-and-language models encode richer individuation information
Abstract
We show differences between a language-and-vision model CLIP and two text-only models - FastText and SBERT - when it comes to the encoding of individuation information. We study latent representations that CLIP provides for substrates, granular aggregates, and various numbers of objects. We demonstrate that CLIP embeddings capture quantitative differences in individuation better than models trained on text-only data. Moreover, the individuation hierarchy we deduce from the CLIP embeddings agrees with the hierarchies proposed in linguistics and cognitive science.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural dynamics and brain function · Visual perception and processing mechanisms
MethodsfastText · Sentence-BERT · Contrastive Language-Image Pre-training
