Semantic and Expressive Variation in Image Captions Across Languages
Andre Ye, Sebastin Santy, Jena D. Hwang, Amy X. Zhang, Ranjay Krishna

TL;DR
This paper investigates how cultural and linguistic differences influence image captioning across seven languages, revealing significant variations in semantic content and expression that impact dataset diversity and model performance.
Contribution
It demonstrates that multilingual image descriptions contain richer semantic information and that models inherit cultural biases, emphasizing the importance of multilingual data for more inclusive vision-language systems.
Findings
Multilingual captions have 29.9% more objects and 46.0% more attributes.
Models trained on multilingual data perform well across languages.
Cultural biases affect image description and model outputs.
Abstract
Computer vision often treats human perception as homogeneous: an implicit assumption that visual stimuli are perceived similarly by everyone. This assumption is reflected in the way researchers collect datasets and train vision models. By contrast, literature in cross-cultural psychology and linguistics has provided evidence that people from different cultural backgrounds observe vastly different concepts even when viewing the same visual stimuli. In this paper, we study how these differences manifest themselves in vision-language datasets and models, using language as a proxy for culture. By comparing textual descriptions generated across 7 languages for the same images, we find significant differences in the semantic content and linguistic expression. When datasets are multilingual as opposed to monolingual, descriptions have higher semantic coverage on average, where coverage is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCategorization, perception, and language
MethodsSparse Evolutionary Training
