Reconstructing and grounding narrated instructional videos in 3D

Dimitri Zhukov; Ignacio Rocco; Ivan Laptev; Josef Sivic; Johannes L.; Sch\"onberger; Bugra Tekin; Marc Pollefeys

arXiv:2109.04409·cs.CV·September 13, 2021

Reconstructing and grounding narrated instructional videos in 3D

Dimitri Zhukov, Ignacio Rocco, Ivan Laptev, Josef Sivic, Johannes L., Sch\"onberger, Bugra Tekin, Marc Pollefeys

PDF

Open Access

TL;DR

This paper presents a novel method for reconstructing and grounding objects in 3D from narrated instructional videos, effectively handling appearance variations and natural language diversity without manual supervision.

Contribution

It introduces a combined correspondence estimation technique, a divide-and-conquer 3D reconstruction approach, and an unsupervised method for grounding natural language in 3D models.

Findings

01

Successfully reconstructs car engines from diverse videos

02

Effectively associates textual descriptions with 3D objects

03

Operates without manual supervision

Abstract

Narrated instructional videos often show and describe manipulations of similar objects, e.g., repairing a particular model of a car or laptop. In this work we aim to reconstruct such objects and to localize associated narrations in 3D. Contrary to the standard scenario of instance-level 3D reconstruction, where identical objects or scenes are present in all views, objects in different instructional videos may have large appearance variations given varying conditions and versions of the same product. Narrations may also have large variation in natural language expressions. We address these challenges by three contributions. First, we propose an approach for correspondence estimation combining learnt local features and dense flow. Second, we design a two-step divide and conquer reconstruction approach where the initial 3D reconstructions of individual videos are combined into a 3D…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques