Step Differences in Instructional Video

Tushar Nagarajan; Lorenzo Torresani

arXiv:2404.16222·cs.CV·July 1, 2024

Step Differences in Instructional Video

Tushar Nagarajan, Lorenzo Torresani

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel method for comparing instructional videos by automatically generating training data and training a video-conditioned language model, enabling better understanding of differences and reasoning across multiple videos.

Contribution

It presents a new approach that leverages large-scale data from HowTo100M and trains a model for cross-video reasoning, surpassing previous methods in difference identification and ranking.

Findings

01

Achieves state-of-the-art accuracy in video difference detection

02

Effective in ranking videos by difference severity

03

Demonstrates strong reasoning capabilities across multiple videos

Abstract

Comparing a user video to a reference how-to video is a key requirement for AR/VR technology delivering personalized assistance tailored to the user's progress. However, current approaches for language-based assistance can only answer questions about a single video. We propose an approach that first automatically generates large amounts of visual instruction tuning data involving pairs of videos from HowTo100M by leveraging existing step annotations and accompanying narrations, and then trains a video-conditioned language model to jointly reason across multiple raw videos. Our model achieves state-of-the-art performance at identifying differences between video pairs and ranking videos based on the severity of these differences, and shows promising ability to perform general reasoning over multiple videos. Project page: https://github.com/facebookresearch/stepdiff

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/stepdiff
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOnline and Blended Learning · Communication in Education and Healthcare · Multimedia Communication and Technology