GIMMICK -- Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking
Florian Schneider, Carolin Holtermann, Chris Biemann, Anne Lauscher

TL;DR
This paper introduces GIMMICK, a comprehensive benchmark for evaluating large vision-language models' understanding of diverse cultural knowledge across 144 countries, revealing biases and performance patterns.
Contribution
GIMMICK is the first extensive multimodal benchmark assessing cultural knowledge across many countries and cultural aspects, enabling systematic evaluation of LVLMs.
Findings
Models show strong Western cultural biases.
Model size correlates with performance.
Multimodal input improves cultural understanding.
Abstract
Large Vision-Language Models (LVLMs) have recently gained attention due to their distinctive performance and broad applicability. While it has been previously shown that their efficacy in usage scenarios involving non-Western contexts falls short, existing studies are limited in scope, covering just a narrow range of cultures, focusing exclusively on a small number of cultural aspects, or evaluating a limited selection of models on a single task only. Towards globally inclusive LVLM research, we introduce GIMMICK, an extensive multimodal benchmark designed to assess a broad spectrum of cultural knowledge across 144 countries representing six global macro-regions. GIMMICK comprises six tasks built upon three new datasets that span 728 unique cultural events or facets on which we evaluated 20 LVLMs and 11 LLMs, including five proprietary and 26 open-weight models of all sizes. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
MethodsSoftmax · Attention Is All You Need
