Cross-modal Associations in Vision and Language Models: Revisiting the Bouba-Kiki Effect
Tom Kouwenhoven, Kiana Shahrasbi, Tessa Verhoef

TL;DR
This study critically evaluates whether vision-and-language models like CLIP exhibit human-like cross-modal associations, specifically the bouba-kiki effect, and finds that they do not consistently demonstrate these associations, highlighting limitations in their cognitive modeling.
Contribution
The paper provides a comprehensive re-evaluation of the bouba-kiki effect in CLIP variants using novel interpretative methods, revealing limitations in their cross-modal understanding.
Findings
Models do not consistently exhibit the bouba-kiki effect.
ResNet shows a preference for round shapes but lacks robust associations.
Model responses fall short of human cross-modal integration.
Abstract
Recent advances in multimodal models have raised questions about whether vision-and-language models (VLMs) integrate cross-modal information in ways that reflect human cognition. One well-studied test case in this domain is the bouba-kiki effect, where humans reliably associate pseudowords like `bouba' with round shapes and `kiki' with jagged ones. Given the mixed evidence found in prior studies for this effect in VLMs, we present a comprehensive re-evaluation focused on two variants of CLIP, ResNet and Vision Transformer (ViT), given their centrality in many state-of-the-art VLMs. We apply two complementary methods closely modelled after human experiments: a prompt-based evaluation that uses probabilities as a measure of model preference, and we use Grad-CAM as a novel approach to interpret visual attention in shape-word matching tasks. Our findings show that these model variants do…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage, Metaphor, and Cognition · EFL/ESL Teaching and Learning · Language, Discourse, Communication Strategies
