Language-Conditioned Visual Grounding with CLIP Multilingual
J. de Curt\`o, Mauro Liz, I. de Zarz\`a

TL;DR
This study investigates the performance gaps in multilingual vision-language models, revealing that the text branch causes most issues, especially in low-resource languages, and highlights the importance of spatial alignment for effective grounding.
Contribution
The paper introduces a dense multilingual CLIP probe to isolate the source of performance gaps, demonstrating the impact of language resource levels and model scaling on visual grounding accuracy.
Findings
Low-resource languages show significant deficits in the text branch.
Scaling the visual encoder widens the gap for some languages but improves others.
Spatial misalignment, not signal collapse, is the main failure mode.
Abstract
Multilingual vision-language models exhibit systematic performance gaps across languages, but the mechanism remains ambiguous: cross-language divergence could arise from the visual encoder, the text branch, or their interaction. We resolve this ambiguity through a dense multilingual CLIP probe in which the visual encoder is held identical across thirteen typologically diverse languages and only the XLM-RoBERTa text branch varies. We evaluate two CLIP architectures spanning a 7x visual-encoder scale gap (XLM-R base + ViT-B/32, ~87M visual parameters; XLM-R large + ViT-H/14, ~632M) on 11 concepts and 210 images, and quantify cross-language agreement via cluster-mask IoU, top-percentile IoU, and Spearman rank correlation against an English reference (n=2,310 paired observations per language). Three findings emerge. First, low-resource languages (Arabic, Basque, Luxembourgish) incur a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
