DynaStride: Dynamic Stride Windowing with MMCoT for Instructional Multi-Scene Captioning

Eddison Pham; Prisha Priyadarshini; Adrian Maliackel; Kanishk Bandi; Cristian Meo; Kevin Zhu

arXiv:2510.23907·cs.CV·December 2, 2025

DynaStride: Dynamic Stride Windowing with MMCoT for Instructional Multi-Scene Captioning

Eddison Pham, Prisha Priyadarshini, Adrian Maliackel, Kanishk Bandi, Cristian Meo, Kevin Zhu

PDF

1 Video

TL;DR

DynaStride is a novel pipeline that generates coherent scene-level captions for instructional videos by adaptively sampling frames and employing multimodal reasoning, improving temporal coherence and informativeness without manual scene segmentation.

Contribution

The paper introduces DynaStride, a new method that automatically produces scene-level captions in instructional videos using adaptive sampling and multimodal reasoning, eliminating the need for manual scene segmentation.

Findings

01

DynaStride outperforms strong baselines on BLEU, METEOR, BERTScore, and CLIPScore.

02

Captions generated by DynaStride are more temporally coherent and informative.

03

Empirical results demonstrate consistent improvements across multiple evaluation metrics.

Abstract

Scene-level captioning in instructional videos can enhance learning by requiring an understanding of both visual cues and temporal structure. By aligning visual cues with textual guidance, this understanding supports procedural learning and multimodal reasoning, providing a richer context for skill acquisition. However, captions that fail to capture this structure may lack coherence and quality, which can create confusion and undermine the video's educational intent. To address this gap, we introduce DynaStride, a pipeline to generate coherent, scene-level captions without requiring manual scene segmentation. Using the YouCookII dataset's scene annotations, DynaStride performs adaptive frame sampling and multimodal windowing to capture key transitions within each scene. It then employs a multimodal chain-of-thought process to produce multiple action-object pairs, which are refined and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

DynaStride: Dynamic Stride Windowing with MMCoT for Instructional Multi-Scene Captioning· underline