What does Kiki look like? Cross-modal associations between speech sounds   and visual shapes in vision-and-language models

Tessa Verhoef; Kiana Shahrasbi; Tom Kouwenhoven

arXiv:2407.17974·cs.CL·July 26, 2024

What does Kiki look like? Cross-modal associations between speech sounds and visual shapes in vision-and-language models

Tessa Verhoef, Kiana Shahrasbi, Tom Kouwenhoven

PDF

Open Access

TL;DR

This paper investigates whether vision-and-language models encode human-like cross-modal associations, specifically the bouba-kiki effect, revealing that results depend on model features and informing future AI development.

Contribution

It probes four VLMs for the bouba-kiki effect, providing insights into their alignment with human cross-modal preferences and the factors influencing this alignment.

Findings

01

No conclusive evidence of the bouba-kiki effect in tested models

02

Model features like architecture and size influence cross-modal associations

03

Results inform future development of models aligned with human cognition

Abstract

Humans have clear cross-modal preferences when matching certain novel words to visual shapes. Evidence suggests that these preferences play a prominent role in our linguistic processing, language learning, and the origins of signal-meaning mappings. With the rise of multimodal models in AI, such as vision- and-language (VLM) models, it becomes increasingly important to uncover the kinds of visio-linguistic associations these models encode and whether they align with human representations. Informed by experiments with humans, we probe and compare four VLMs for a well-known human cross-modal preference, the bouba-kiki effect. We do not find conclusive evidence for this effect but suggest that results may depend on features of the models, such as architecture design, model size, and training details. Our findings inform discussions on the origins of the bouba-kiki effect in human cognition…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLanguage, Metaphor, and Cognition · Hearing Impairment and Communication · Multisensory perception and integration

MethodsALIGN