Towers of Babel: Combining Images, Language, and 3D Geometry for   Learning Multimodal Vision

Xiaoshi Wu; Hadar Averbuch-Elor; Jin Sun; Noah Snavely

arXiv:2108.05863·cs.CV·August 13, 2021

Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision

Xiaoshi Wu, Hadar Averbuch-Elor, Jin Sun, Noah Snavely

PDF

Open Access 1 Repo

TL;DR

This paper introduces WikiScenes, a large-scale multimodal dataset combining images, text, and 3D geometry, and demonstrates its use for learning semantic concepts through a weakly-supervised framework.

Contribution

The work presents WikiScenes, a new dataset and a weakly-supervised method for integrating images, language, and 3D geometry for semantic understanding.

Findings

01

WikiScenes enables multimodal reasoning involving images, text, and 3D models.

02

The framework effectively associates semantic concepts with image pixels and 3D points.

03

The dataset facilitates learning of semantic concepts over landmark collections.

Abstract

The abundance and richness of Internet photos of landmarks and cities has led to significant progress in 3D vision over the past two decades, including automated 3D reconstructions of the world's landmarks from tourist photos. However, a major source of information available for these 3D-augmented collections---namely language, e.g., from image captions---has been virtually untapped. In this work, we present WikiScenes, a new, large-scale dataset of landmark photo collections that contains descriptive text in the form of captions and hierarchical category names. WikiScenes forms a new testbed for multimodal reasoning involving images, text, and 3D geometry. We demonstrate the utility of WikiScenes for learning semantic concepts over images and 3D models. Our weakly-supervised framework connects images, 3D structure, and semantics---utilizing the strong constraints provided by 3D…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tgxs002/wikiscenes
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Image and Video Retrieval Techniques