TL;DR
This paper introduces M$^2$CQA, a multilingual benchmark for evaluating counterfactual hallucination in vision-language models across diverse cultural contexts, revealing significant biases and failure modes.
Contribution
It presents a new culturally grounded benchmark and a metric for measuring counterfactual hallucination, highlighting challenges in multilingual and dialectal settings.
Findings
Counterfactual hallucination rates are higher in Arabic dialects.
Reasoning-first prompting increases hallucination.
Answering before justification improves robustness.
Abstract
Vision-language models (VLMs) can achieve high accuracy while still accepting culturally plausible but visually incorrect interpretations. Existing hallucination benchmarks rarely test this failure mode, particularly outside Western contexts and English. We introduce MCQA, a culturally grounded multimodal benchmark built from images spanning 17 MENA countries, paired with contrastive true and counterfactual statements in English, Arabic, and its dialects. To isolate hallucination beyond raw accuracy, we propose the CounterFactual Hallucination Rate (CFHR), which measures counterfactual acceptance conditioned on correctly answering the true statement. Evaluating state-of-the-art VLMs under multiple prompting strategies, we find that CFHR rises sharply in Arabic, especially in dialects, even when true-statement accuracy remains high. Moreover, reasoning-first prompting consistently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
