Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering

Dosung Lee; Sangwon Jung; Boyoung Kim; Minyoung Kim; Sungyeon Kim; Junyoung Sung; and Paul Hongsuck Seo

arXiv:2511.22843·cs.CV·February 27, 2026

Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering

Dosung Lee, Sangwon Jung, Boyoung Kim, Minyoung Kim, Sungyeon Kim, Junyoung Sung, and Paul Hongsuck Seo

PDF

Open Access

TL;DR

This paper exposes the reliance of current multimodal VQA models on visual shortcuts, introduces a new benchmark called RETINA to challenge this, and proposes MIMIR to improve model robustness by leveraging multiple related images.

Contribution

The paper introduces RETINA, a new benchmark that removes visual shortcuts, and MIMIR, a method that uses multiple images to enhance model understanding in VQA.

Findings

01

Models rely heavily on visual shortcuts in existing benchmarks.

02

RETINA significantly degrades existing model performance, revealing shortcut reliance.

03

MIMIR improves robustness by incorporating multiple related images.

Abstract

Existing Multimodal Knowledge-Based Visual Question Answering (MKB-VQA) benchmarks suffer from "visual shortcuts", as the query image typically matches the primary subject entity of the target document. We demonstrate that models can exploit these shortcuts, achieving comparable results using visual cues alone. To address this, we introduce Relational Entity Text-Image kNowledge Augmented (RETINA) benchmark, automatically constructed using an LLM-driven pipeline, consisting of 120k training and 2k human-curated test set. RETINA contains queries referencing secondary subjects (i.e. related entities) and pairs them with images of these related entities, removing the visual shortcut. When evaluated on RETINA existing models show significantly degraded performance, confirming their reliance on the shortcut. Furthermore, we propose Multi-Image MultImodal Retriever (MIMIR), which enriches…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Advanced Graph Neural Networks