Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision
Xiaoshi Wu, Hadar Averbuch-Elor, Jin Sun, Noah Snavely

TL;DR
This paper introduces WikiScenes, a large-scale multimodal dataset combining images, text, and 3D geometry, and demonstrates its use for learning semantic concepts through a weakly-supervised framework.
Contribution
The work presents WikiScenes, a new dataset and a weakly-supervised method for integrating images, language, and 3D geometry for semantic understanding.
Findings
WikiScenes enables multimodal reasoning involving images, text, and 3D models.
The framework effectively associates semantic concepts with image pixels and 3D points.
The dataset facilitates learning of semantic concepts over landmark collections.
Abstract
The abundance and richness of Internet photos of landmarks and cities has led to significant progress in 3D vision over the past two decades, including automated 3D reconstructions of the world's landmarks from tourist photos. However, a major source of information available for these 3D-augmented collections---namely language, e.g., from image captions---has been virtually untapped. In this work, we present WikiScenes, a new, large-scale dataset of landmark photo collections that contains descriptive text in the form of captions and hierarchical category names. WikiScenes forms a new testbed for multimodal reasoning involving images, text, and 3D geometry. We demonstrate the utility of WikiScenes for learning semantic concepts over images and 3D models. Our weakly-supervised framework connects images, 3D structure, and semantics---utilizing the strong constraints provided by 3D…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Image and Video Retrieval Techniques
