TL;DR
DenseStep2M is a large-scale, training-free pipeline that automatically extracts high-quality procedural annotations from instructional videos, enabling improved long-term video understanding and downstream task performance.
Contribution
We introduce DenseStep2M, a novel automated pipeline that generates a comprehensive instructional video dataset without training, leveraging multimodal models for high-quality annotation and evaluation.
Findings
Models trained on DenseStep2M show improved captioning and localization performance.
The dataset enables robust zero-shot generalization across different video perspectives.
Auto-generated steps align well with human annotations in evaluation.
Abstract
Long-term video understanding requires interpreting complex temporal events and reasoning over procedural activities. While instructional video corpora, like HowTo100M, offer rich resources for model training, they present significant challenges, including noisy ASR transcripts and inconsistent temporal alignments between narration and visual content. In this work, we introduce an automated, training-free pipeline to extract high-quality procedural annotations from in-the-wild instructional videos. Our approach segments videos into coherent shots, filters poorly aligned content, and leverages state-of-the-art multimodal and large language models (Qwen2.5-VL and DeepSeek-R1) to generate structured, temporally grounded procedural steps. This pipeline yields DenseStep2M, a large-scale dataset comprising approximately 100K videos and 2M detailed instructional steps, designed to support…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
