TL;DR
This paper introduces a method to analyze multi-modal video question answering models using Shapley values, revealing the models' reliance on text and their tendency to ignore distractors, with implications for model evaluation.
Contribution
It proposes a novel joint attribution method for features and modalities in VQA models and applies it to compare different models across datasets.
Findings
Models show dependence on textual information.
VQA tasks often reduce to ignoring distractors.
Analysis highlights the importance of modality interactions.
Abstract
As we become increasingly dependent on vision language models (VLMs) to answer questions about the world around us, there is a significant amount of research devoted to increasing both the difficulty of video question answering (VQA) datasets, and the context lengths of the models that they evaluate. The reliance on large language models as backbones has lead to concerns about potential text dominance, and the exploration of interactions between modalities is underdeveloped. How do we measure whether we're heading in the right direction, with the complexity that multi-modal models introduce? We propose a joint method of computing both feature attributions and modality scores based on Shapley values, where both the features and modalities are arbitrarily definable. Using these metrics, we compare VLM models of varying context lengths on representative datasets, focusing on…
Peer Reviews
Decision·Submitted to ICLR 2026
Pros: 1. The paper is very well motivated from an evaluation perspective. Finding which modality plays part in final output is very important. I can see many downstream tasks requiring explainability. 2. In a way, the authors are under-selling the paper as a diagnosis for methods only. I think especially from L299, this can be a good diagnostic for benchmark creation as well. Usual way for benchmark creation to ensure requirement of video modality is to show video-blind models perform similar
Cons: 1. My main concern is that the only task evaluated in multiple-choice VQA. This severely restricts it applications (which authors also note in limitations section 5). Authors should at least experiment with full-string match? I am slightly confused why a trivial extension of proposed method cannot be done with say removing parts of the output tokens? It would be great to have the authors expand on this. 2. The main takeaway is that video modality is under-represented. It is a good to ha
The extension to video generates new results that differ from the image+text results in somewhat interesting ways. Specifically: (1) unlike images, video is always less important than text, though only a small number of questions can be correctly answered with the video completely masked, (2) the importance of the video increases if the number of candidate answers in a multiple-choice question is increased by rotating in some answers randomly chosen from other questions.
Although the analyses of video are interesting, it is difficult to recommend acceptance because all of the proposed algorithms have previously been applied, in more or less the same form, to image-text VQA.
1. The introduction of a Shapley-value–based attribution framework to investigate modality bias in VLMs is novel. This approach provides a principled and interpretable means of quantifying the relative contributions of visual and textual inputs, which has not been systematically explored in prior multimodal reasoning research. 2. The paper presents an extensive and well-structured empirical study, evaluating six VLMs with varying context lengths across four diverse VQA datasets that differ in p
1. Despite the thorough experimental analysis, the paper lacks a clear methodological or algorithmic contribution that meets the technical novelty threshold typically expected at ICLR. The proposed attribution framework, while well-motivated, primarily extends existing interpretability techniques rather than introducing a fundamentally new learning paradigm or model architecture to address their findings. 2. Many of the reported empirical findings reiterate observations that have been discussed
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
