Individuation in Neural Models with and without Visual Grounding

Alexey Tikhonov; Lisa Bylinina; Ivan P. Yamshchikov

arXiv:2409.18868·cs.CL·September 30, 2024

Individuation in Neural Models with and without Visual Grounding

Alexey Tikhonov, Lisa Bylinina, Ivan P. Yamshchikov

PDF

Open Access

TL;DR

This paper compares how a vision-and-language model and text-only models encode individuation information, showing that CLIP captures hierarchical distinctions more effectively and aligns with linguistic and cognitive hierarchies.

Contribution

It demonstrates that CLIP embeddings encode individuation hierarchies more accurately than text-only models and aligns with linguistic and cognitive theories.

Findings

01

CLIP captures quantitative differences in individuation better than text-only models

02

CLIP embeddings agree with hierarchies in linguistics and cognitive science

03

Vision-and-language models encode richer individuation information

Abstract

We show differences between a language-and-vision model CLIP and two text-only models - FastText and SBERT - when it comes to the encoding of individuation information. We study latent representations that CLIP provides for substrates, granular aggregates, and various numbers of objects. We demonstrate that CLIP embeddings capture quantitative differences in individuation better than models trained on text-only data. Moreover, the individuation hierarchy we deduce from the CLIP embeddings agrees with the hierarchies proposed in linguistics and cognitive science.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural dynamics and brain function · Visual perception and processing mechanisms

MethodsfastText · Sentence-BERT · Contrastive Language-Image Pre-training