Vision Language Models are Confused Tourists

Patrick Amadeus Irawan; Ikhlasul Akmal Hanif; Muhammad Dehan Al Kautsar; Genta Indra Winata; Fajri Koto; Alham Fikri Aji

arXiv:2511.17004·cs.CV·December 24, 2025

Vision Language Models are Confused Tourists

Patrick Amadeus Irawan, Ikhlasul Akmal Hanif, Muhammad Dehan Al Kautsar, Genta Indra Winata, Fajri Koto, Alham Fikri Aji

PDF

Open Access 1 Datasets

TL;DR

This paper introduces ConfusedTourist, a benchmark to evaluate Vision-Language Models' robustness to cultural perturbations, revealing significant vulnerabilities and attention shifts that impair model stability across diverse cultural inputs.

Contribution

The paper presents a novel adversarial robustness suite for cultural evaluation of VLMs, exposing their weaknesses in handling mixed cultural cues and highlighting the need for improved cultural robustness.

Findings

01

VLM accuracy drops under simple cultural perturbations

02

Image-generation-based perturbations worsen model performance

03

Attention shifts cause models to focus on distracting cues

Abstract

Although the cultural dimension has been one of the key aspects in evaluating Vision-Language Models (VLMs), their ability to remain stable across diverse cultural inputs remains largely untested, despite being crucial to support diversity and multicultural societies. Existing evaluations often rely on benchmarks featuring only a singular cultural concept per image, overlooking scenarios where multiple, potentially unrelated cultural cues coexist. To address this gap, we introduce ConfusedTourist, a novel cultural adversarial robustness suite designed to assess VLMs' stability against perturbed geographical cues. Our experiments reveal a critical vulnerability, where accuracy drops heavily under simple image-stacking perturbations and even worsens with its image-generation-based variant. Interpretability analyses further show that these failures stem from systematic attention shifts…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

patrickamadeus/vlms-are-confused-tourists
dataset· 43 dl
43 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Ethics and Social Impacts of AI