Benchmark Visual Question Answer Models by using Focus Map
Wenda Qiu, Yueyang Xianzang, Zhekai Zhang

TL;DR
This paper introduces a method to evaluate focus maps in visual reasoning models, demonstrating that certain models learn to focus on relevant objects more effectively than end-to-end models.
Contribution
It proposes a novel evaluation approach for focus maps in visual reasoning models and applies it to compare different models on the CLEVR dataset.
Findings
CLEVR-iep model learns to focus on relevant objects more than end-to-end models
The evaluation method can be applied to any model with inferable focus maps
Focus maps correlate with model performance on visual reasoning tasks
Abstract
Inferring and Executing Programs for Visual Reasoning proposes a model for visual reasoning that consists of a program generator and an execution engine to avoid end-to-end models. To show that the model actually learns which objects to focus on to answer the questions, the authors give a visualization of the norm of the gradient of the sum of the predicted answer scores with respect to the final feature map. However, the authors do not evaluate the efficiency of focus map. This paper purposed a method for evaluating it. We generate several kinds of questions to test different keywords. We infer focus maps from the model by asking these questions and evaluate them by comparing with the segmentation graph. Furthermore, this method can be applied to any model if focus maps can be inferred from it. By evaluating focus map of different models on the CLEVR dataset, we will show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
