MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment

Sagarika Banerjee; Tangatar Madi; Advait Swaminathan; Nguyen Dao Minh Anh; Shivank Garg; Kevin Zhu; Vasu Sharma

arXiv:2602.18729·cs.CV·February 24, 2026

MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment

Sagarika Banerjee, Tangatar Madi, Advait Swaminathan, Nguyen Dao Minh Anh, Shivank Garg, Kevin Zhu, Vasu Sharma

PDF

Open Access 1 Video

TL;DR

MiSCHiEF introduces two benchmark datasets to evaluate vision-language models on their ability to distinguish subtle safety and cultural differences in images and captions, revealing persistent challenges in fine-grained alignment.

Contribution

The paper presents MiSCHiEF, novel datasets for assessing fine-grained image-caption alignment in safety and cultural contexts, highlighting current model limitations in subtle cross-modal distinctions.

Findings

01

Models perform better at confirming correct pairs than rejecting incorrect ones.

02

Higher accuracy when selecting correct captions from similar options than vice versa.

03

Persistent modality misalignment challenges in current vision-language models.

Abstract

Fine-grained image-caption alignment is crucial for vision-language models (VLMs), especially in socially critical contexts such as identifying real-world risk scenarios or distinguishing cultural proxies, where correct interpretation hinges on subtle visual or linguistic clues and where minor misinterpretations can lead to significant real-world consequences. We present MiSCHiEF, a set of two benchmarking datasets based on a contrastive pair design in the domains of safety (MiS) and culture (MiC), and evaluate four VLMs on tasks requiring fine-grained differentiation of paired images and captions. In both datasets, each sample contains two minimally differing captions and corresponding minimally differing images. In MiS, the image-caption pairs depict a safe and an unsafe scenario, while in MiC, they depict cultural proxies in two distinct cultural contexts. We find that models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Natural Language Processing Techniques