TL;DR
DynaStride is a novel pipeline that generates coherent scene-level captions for instructional videos by adaptively sampling frames and employing multimodal reasoning, improving temporal coherence and informativeness without manual scene segmentation.
Contribution
The paper introduces DynaStride, a new method that automatically produces scene-level captions in instructional videos using adaptive sampling and multimodal reasoning, eliminating the need for manual scene segmentation.
Findings
DynaStride outperforms strong baselines on BLEU, METEOR, BERTScore, and CLIPScore.
Captions generated by DynaStride are more temporally coherent and informative.
Empirical results demonstrate consistent improvements across multiple evaluation metrics.
Abstract
Scene-level captioning in instructional videos can enhance learning by requiring an understanding of both visual cues and temporal structure. By aligning visual cues with textual guidance, this understanding supports procedural learning and multimodal reasoning, providing a richer context for skill acquisition. However, captions that fail to capture this structure may lack coherence and quality, which can create confusion and undermine the video's educational intent. To address this gap, we introduce DynaStride, a pipeline to generate coherent, scene-level captions without requiring manual scene segmentation. Using the YouCookII dataset's scene annotations, DynaStride performs adaptive frame sampling and multimodal windowing to capture key transitions within each scene. It then employs a multimodal chain-of-thought process to produce multiple action-object pairs, which are refined and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
