DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation

Mingji Ge; Qirui Chen; Zeqian Li; Weidi Xie

arXiv:2604.26565·cs.CV·April 30, 2026

DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation

Mingji Ge, Qirui Chen, Zeqian Li, Weidi Xie

PDF

1 Repo

TL;DR

DenseStep2M is a large-scale, training-free pipeline that automatically extracts high-quality procedural annotations from instructional videos, enabling improved long-term video understanding and downstream task performance.

Contribution

We introduce DenseStep2M, a novel automated pipeline that generates a comprehensive instructional video dataset without training, leveraging multimodal models for high-quality annotation and evaluation.

Findings

01

Models trained on DenseStep2M show improved captioning and localization performance.

02

The dataset enables robust zero-shot generalization across different video perspectives.

03

Auto-generated steps align well with human annotations in evaluation.

Abstract

Long-term video understanding requires interpreting complex temporal events and reasoning over procedural activities. While instructional video corpora, like HowTo100M, offer rich resources for model training, they present significant challenges, including noisy ASR transcripts and inconsistent temporal alignments between narration and visual content. In this work, we introduce an automated, training-free pipeline to extract high-quality procedural annotations from in-the-wild instructional videos. Our approach segments videos into coherent shots, filters poorly aligned content, and leverages state-of-the-art multimodal and large language models (Qwen2.5-VL and DeepSeek-R1) to generate structured, temporally grounded procedural steps. This pipeline yields DenseStep2M, a large-scale dataset comprising approximately 100K videos and 2M detailed instructional steps, designed to support…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://huggingface.co/datasets/mingjige/DenseStep2M
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.