Reconstructing and grounding narrated instructional videos in 3D
Dimitri Zhukov, Ignacio Rocco, Ivan Laptev, Josef Sivic, Johannes L., Sch\"onberger, Bugra Tekin, Marc Pollefeys

TL;DR
This paper presents a novel method for reconstructing and grounding objects in 3D from narrated instructional videos, effectively handling appearance variations and natural language diversity without manual supervision.
Contribution
It introduces a combined correspondence estimation technique, a divide-and-conquer 3D reconstruction approach, and an unsupervised method for grounding natural language in 3D models.
Findings
Successfully reconstructs car engines from diverse videos
Effectively associates textual descriptions with 3D objects
Operates without manual supervision
Abstract
Narrated instructional videos often show and describe manipulations of similar objects, e.g., repairing a particular model of a car or laptop. In this work we aim to reconstruct such objects and to localize associated narrations in 3D. Contrary to the standard scenario of instance-level 3D reconstruction, where identical objects or scenes are present in all views, objects in different instructional videos may have large appearance variations given varying conditions and versions of the same product. Narrations may also have large variation in natural language expressions. We address these challenges by three contributions. First, we propose an approach for correspondence estimation combining learnt local features and dense flow. Second, we design a two-step divide and conquer reconstruction approach where the initial 3D reconstructions of individual videos are combined into a 3D…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques
