Building a Large-scale Multimodal Knowledge Base System for Answering Visual Queries
Yuke Zhu, Ce Zhang, Christopher R\'e, Li Fei-Fei

TL;DR
This paper introduces a scalable large-scale multimodal knowledge base system that integrates visual, textual, and structured data to improve visual query answering without retraining classifiers.
Contribution
It presents a novel scalable KB construction system capable of handling half a billion variables, enabling flexible and comprehensive visual query answering.
Findings
Achieves competitive recognition and retrieval results
Builds a KB with half a billion variables in hours
Enhances ability to answer complex visual queries
Abstract
The complexity of the visual world creates significant challenges for comprehensive visual understanding. In spite of recent successes in visual recognition, today's vision systems would still struggle to deal with visual queries that require a deeper reasoning. We propose a knowledge base (KB) framework to handle an assortment of visual queries, without the need to train new classifiers for new tasks. Building such a large-scale multimodal KB presents a major challenge of scalability. We cast a large-scale MRF into a KB representation, incorporating visual, textual and structured data, as well as their diverse relations. We introduce a scalable knowledge base construction system that is capable of building a KB with half billion variables and millions of parameters in a few hours. Our system achieves competitive results compared to purpose-built models on standard recognition and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
