CogStream: Context-guided Streaming Video Question Answering

Zicheng Zhao; Kangyu Wang; Shijie Li; Rui Qian; Weiyao Lin; Huabin Liu

arXiv:2506.10516·cs.CV·December 30, 2025

CogStream: Context-guided Streaming Video Question Answering

Zicheng Zhao, Kangyu Wang, Shijie Li, Rui Qian, Weiyao Lin, Huabin Liu

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces CogStream, a new streaming video reasoning task that emphasizes relevance-based context selection, supported by a new dataset and baseline model, to improve efficiency and accuracy in real-world scenarios.

Contribution

It proposes the CogStream task, creates a densely annotated dataset, and develops CogReasoner, a model that improves streaming video question answering by selecting relevant context.

Findings

01

CogReasoner effectively handles streaming video reasoning tasks.

02

The dataset enables detailed evaluation of context relevance.

03

Relevance-guided context selection improves model performance.

Abstract

Despite advancements in Video Large Language Models (Vid-LLMs) improving multimodal understanding, challenges persist in streaming video reasoning due to its reliance on contextual information. Existing paradigms feed all available historical contextual information into Vid-LLMs, resulting in a significant computational burden for visual data processing. Furthermore, the inclusion of irrelevant context distracts models from key details. This paper introduces a challenging task called Context-guided Streaming Video Reasoning (CogStream), which simulates real-world streaming video scenarios, requiring models to identify the most relevant historical contextual information to deduce answers for questions about the current stream. To support CogStream, we present a densely annotated dataset featuring extensive and hierarchical question-answer pairs, generated by a semi-automatic pipeline.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

SII-KYW/CogStream
dataset· 6.7k dl
6.7k dl

Videos

CogStream: Context-guided Streaming Video Question Answering· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Topic Modeling