Step Differences in Instructional Video
Tushar Nagarajan, Lorenzo Torresani

TL;DR
This paper introduces a novel method for comparing instructional videos by automatically generating training data and training a video-conditioned language model, enabling better understanding of differences and reasoning across multiple videos.
Contribution
It presents a new approach that leverages large-scale data from HowTo100M and trains a model for cross-video reasoning, surpassing previous methods in difference identification and ranking.
Findings
Achieves state-of-the-art accuracy in video difference detection
Effective in ranking videos by difference severity
Demonstrates strong reasoning capabilities across multiple videos
Abstract
Comparing a user video to a reference how-to video is a key requirement for AR/VR technology delivering personalized assistance tailored to the user's progress. However, current approaches for language-based assistance can only answer questions about a single video. We propose an approach that first automatically generates large amounts of visual instruction tuning data involving pairs of videos from HowTo100M by leveraging existing step annotations and accompanying narrations, and then trains a video-conditioned language model to jointly reason across multiple raw videos. Our model achieves state-of-the-art performance at identifying differences between video pairs and ranking videos based on the severity of these differences, and shows promising ability to perform general reasoning over multiple videos. Project page: https://github.com/facebookresearch/stepdiff
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOnline and Blended Learning · Communication in Education and Healthcare · Multimedia Communication and Technology
