Measuring CLEVRness: Blackbox testing of Visual Reasoning Models
Spyridon Mouselinos, Henryk Michalewski, Mateusz Malinowski

TL;DR
This paper introduces a behavioral testing framework for visual reasoning models using a game where an adversary reconfigures scenes to test if models truly reason or exploit dataset biases, revealing limitations in current models.
Contribution
It proposes a novel black-box testing method involving an adversarial scene reconfiguration to evaluate reasoning abilities of visual QA models.
Findings
CLEVR models can be easily fooled by adversarial reconfigurations
Current models may rely on dataset biases rather than true reasoning
The method provides a controlled way to measure reasoning efficiency
Abstract
How can we measure the reasoning capabilities of intelligence systems? Visual question answering provides a convenient framework for testing the model's abilities by interrogating the model through questions about the scene. However, despite scores of various visual QA datasets and architectures, which sometimes yield even a super-human performance, the question of whether those architectures can actually reason remains open to debate. To answer this, we extend the visual question answering framework and propose the following behavioral test in the form of a two-player game. We consider black-box neural models of CLEVR. These models are trained on a diagnostic dataset benchmarking reasoning. Next, we train an adversarial player that re-configures the scene to fool the CLEVR model. We show that CLEVR models, which otherwise could perform at a human level, can easily be fooled by our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
