Sherlock: Scalable Fact Learning in Images
Mohamed Elhoseiny, Scott Cohen, Walter Chang, Brian Price, Ahmed, Elgammal

TL;DR
This paper introduces Sherlock, a scalable framework for understanding and modeling structured facts in images, enabling uniform recognition of objects, attributes, actions, and interactions simultaneously, with improved generalization and retrieval performance.
Contribution
Sherlock presents a unified approach to model diverse visual facts in images, introducing new models and datasets for structured fact learning and demonstrating their effectiveness.
Findings
Structured fact modeling improves visual understanding.
Proposed models outperform baselines in fact retrieval.
Large-scale dataset supports scalable fact learning.
Abstract
We study scalable and uniform understanding of facts in images. Existing visual recognition systems are typically modeled differently for each fact type such as objects, actions, and interactions. We propose a setting where all these facts can be modeled simultaneously with a capacity to understand unbounded number of facts in a structured way. The training data comes as structured facts in images, including (1) objects (e.g., boy), (2) attributes (e.g., boy, tall), (3) actions (e.g., boy, playing), and (4) interactions (e.g., boy, riding, a horse ). Each fact has a semantic language view (e.g., boy, playing) and a visual view (an image with this fact). We show that learning visual facts in a structured way enables not only a uniform but also generalizable visual understanding. We propose and investigate recent and strong approaches from the multiview…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
