StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA

Yuhang Hu; Zhenyu Yang; Shihan Wang; Shengsheng Qian; Bin Wen; Fan Yang; Tingting Gao; Changsheng Xu

arXiv:2510.25332·cs.CV·October 30, 2025

StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA

Yuhang Hu, Zhenyu Yang, Shihan Wang, Shengsheng Qian, Bin Wen, Fan Yang, Tingting Gao, Changsheng Xu

PDF

TL;DR

StreamingCoT introduces a novel dataset for temporal reasoning in streaming VideoQA, enabling models to understand evolving answers and explicit reasoning processes in dynamic video streams.

Contribution

It presents the first dataset with temporally evolving reasoning annotations and a framework for explicit multimodal reasoning in streaming video question answering.

Findings

01

Dataset captures dynamic answer evolution in streaming videos.

02

Framework enables explicit spatiotemporal reasoning paths.

03

Supports development of models with improved temporal and multimodal reasoning.

Abstract

The rapid growth of streaming video applications demands multimodal models with enhanced capabilities for temporal dynamics understanding and complex reasoning. However, current Video Question Answering (VideoQA) datasets suffer from two critical limitations: 1) Static annotation mechanisms fail to capture the evolving nature of answers in temporal video streams, and 2) The absence of explicit reasoning process annotations restricts model interpretability and logical deduction capabilities. To address these challenges, We introduce StreamingCoT, the first dataset explicitly designed for temporally evolving reasoning in streaming VideoQA and multimodal Chain-of-Thought (CoT) tasks. Our framework first establishes a dynamic hierarchical annotation architecture that generates per-second dense descriptions and constructs temporally-dependent semantic segments through similarity fusion,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.