Lift-the-flap: what, where and when for context reasoning
Mengmi Zhang, Claire Tseng, Karla Montejo, Joseph Kwon, Gabriel, Kreiman

TL;DR
This paper investigates spatial and temporal context constraints in visual recognition, using human experiments and a recurrent model to improve scene understanding and object inference in images.
Contribution
It introduces a model that mimics human contextual reasoning by attending to salient regions and dynamically integrating information, achieving human-level accuracy.
Findings
Model attains human-level contextual reasoning accuracy.
Model exhibits human-like sampling behavior.
Learned features are interpretable for contextual reasoning.
Abstract
Context reasoning is critical in a wide variety of applications where current inputs need to be interpreted in the light of previous experience and knowledge. Both spatial and temporal contextual information play a critical role in the domain of visual recognition. Here we investigate spatial constraints (what image features provide contextual information and where they are located), and temporal constraints (when different contextual cues matter) for visual recognition. The task is to reason about the scene context and infer what a target object hidden behind a flap is in a natural image. To tackle this problem, we first describe an online human psychophysics experiment recording active sampling via mouse clicks in lift-the-flap games and identify clicking patterns and features which are diagnostic for high contextual reasoning accuracy. As a proof of the usefulness of these clicking…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)
