Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs
Hao Wang, Pinzhi Huang, Jihan Yang, Saining Xie, Daisuke Kawahara

TL;DR
This paper introduces two benchmarks, KnowRecall and VisRecall, to evaluate cross-lingual consistency in multimodal large language models, revealing current models' struggles with cultural and factual multilingual understanding.
Contribution
The paper presents novel benchmarks for assessing cross-lingual consistency in MLLMs, highlighting the challenges and gaps in current models' multilingual and cultural knowledge capabilities.
Findings
State-of-the-art MLLMs show limited cross-lingual consistency.
Models struggle with factual and visual memory across languages.
Need for more robust multilingual and culturally aware models.
Abstract
The rapid evolution of multimodal large language models (MLLMs) has significantly enhanced their real-world applications. However, achieving consistent performance across languages, especially when integrating cultural knowledge, remains a significant challenge. To better assess this issue, we introduce two new benchmarks: KnowRecall and VisRecall, which evaluate cross-lingual consistency in MLLMs. KnowRecall is a visual question answering benchmark designed to measure factual knowledge consistency in 15 languages, focusing on cultural and historical questions about global landmarks. VisRecall assesses visual memory consistency by asking models to describe landmark appearances in 9 languages without access to images. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, still struggle to achieve cross-lingual consistency. This underscores the need for more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · linguistics and terminology studies · Translation Studies and Practices
MethodsAttentive Walk-Aggregating Graph Neural Network
