Cross-modal Associations in Vision and Language Models: Revisiting the Bouba-Kiki Effect

Tom Kouwenhoven; Kiana Shahrasbi; Tessa Verhoef

arXiv:2507.10013·cs.CV·October 16, 2025

Cross-modal Associations in Vision and Language Models: Revisiting the Bouba-Kiki Effect

Tom Kouwenhoven, Kiana Shahrasbi, Tessa Verhoef

PDF

Open Access

TL;DR

This study critically evaluates whether vision-and-language models like CLIP exhibit human-like cross-modal associations, specifically the bouba-kiki effect, and finds that they do not consistently demonstrate these associations, highlighting limitations in their cognitive modeling.

Contribution

The paper provides a comprehensive re-evaluation of the bouba-kiki effect in CLIP variants using novel interpretative methods, revealing limitations in their cross-modal understanding.

Findings

01

Models do not consistently exhibit the bouba-kiki effect.

02

ResNet shows a preference for round shapes but lacks robust associations.

03

Model responses fall short of human cross-modal integration.

Abstract

Recent advances in multimodal models have raised questions about whether vision-and-language models (VLMs) integrate cross-modal information in ways that reflect human cognition. One well-studied test case in this domain is the bouba-kiki effect, where humans reliably associate pseudowords like `bouba' with round shapes and `kiki' with jagged ones. Given the mixed evidence found in prior studies for this effect in VLMs, we present a comprehensive re-evaluation focused on two variants of CLIP, ResNet and Vision Transformer (ViT), given their centrality in many state-of-the-art VLMs. We apply two complementary methods closely modelled after human experiments: a prompt-based evaluation that uses probabilities as a measure of model preference, and we use Grad-CAM as a novel approach to interpret visual attention in shape-word matching tasks. Our findings show that these model variants do…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLanguage, Metaphor, and Cognition · EFL/ESL Teaching and Learning · Language, Discourse, Communication Strategies