Towards Unsupervised Visual Reasoning: Do Off-The-Shelf Features Know How to Reason?
Monika Wysocza\'nska, Tom Monnier, Tomasz Trzci\'nski, David Picard

TL;DR
This paper evaluates the effectiveness of off-the-shelf visual features for complex visual reasoning tasks, revealing that object-centric features outperform local features but still fall short of ideal representations.
Contribution
It introduces a novel evaluation protocol for visual representations in reasoning tasks, comparing local and object-centric features using a dedicated attention-based reasoning module.
Findings
Object-centric features better preserve reasoning-critical information.
Off-the-shelf features underperform on complex reasoning despite good proxy task results.
The proposed framework decouples feature extraction from reasoning for evaluation.
Abstract
Recent advances in visual representation learning allowed to build an abundance of powerful off-the-shelf features that are ready-to-use for numerous downstream tasks. This work aims to assess how well these features preserve information about the objects, such as their spatial location, their visual properties and their relative relationships. We propose to do so by evaluating them in the context of visual reasoning, where multiple objects with complex relationships and different attributes are at play. More specifically, we introduce a protocol to evaluate visual representations for the task of Visual Question Answering. In order to decouple visual feature extraction from reasoning, we design a specific attention-based reasoning module which is trained on the frozen visual representations to be evaluated, in a spirit similar to standard feature evaluations relying on shallow networks.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection
