SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding
Zhenyu Yang, Yuhang Hu, Zemin Du, Dizhan Xue, Shengsheng Qian, Jiahong Wu, Fan Yang, Weiming Dong, Changsheng Xu

TL;DR
SVBench is a new benchmark designed to evaluate large vision-language models on their ability to understand and reason over long streaming videos through multi-turn question-answering chains, highlighting current limitations and progress.
Contribution
Introduces SVBench, a comprehensive benchmark with temporal multi-turn QA chains for streaming video understanding, along with a new StreamingChat model that outperforms open-source LVLMs.
Findings
GPT-4o outperforms other models in streaming video tasks
Most open-source LVLMs struggle with long-context streaming videos
StreamingChat significantly outperforms open-source LVLMs on SVBench
Abstract
Despite the significant advancements of Large Vision-Language Models (LVLMs) on established benchmarks, there remains a notable gap in suitable evaluation regarding their applicability in the emerging domain of long-context streaming video understanding. Current benchmarks for video understanding typically emphasize isolated single-instance text inputs and fail to evaluate the capacity to sustain temporal reasoning throughout the entire duration of video streams. To address these limitations, we introduce SVBench, a pioneering benchmark with temporal multi-turn question-answering chains specifically designed to thoroughly assess the capabilities of streaming video understanding of current LVLMs. We design a semi-automated annotation pipeline to obtain 49,979 Question-Answer (QA) pairs of 1,353 streaming videos, which includes generating QA chains that represent a series of consecutive…
Peer Reviews
Decision·ICLR 2025 Spotlight
- Novel Technical Contribution: The paper introduces a semi-automated pipeline that combines LLM-assisted generation with human verification to create temporal multi-turn QA chains, representing a methodologically sound approach to dataset creation. - Comprehensive Empirical Validation: The evaluation spans 14 different models (both open and closed-source), uses multiple metrics (METEOR, GPT4-Score, etc.), and includes detailed ablation studies comparing single-instance vs. multi-turn QA perfor
The paper's heavy reliance on LLMs for evaluation is a significant methodological concern. While authors use GPT-4 to assess models’ performance across multiple dimensions (semantic accuracy, contextual coherence, etc.), there's no validation of whether these automated scores align with human judgments. The fact that LLMs are being used to both generate the annotations and evaluate the results creates a circular dependency that could mask real limitations or biases in the evaluation process. Wit
S1: The paper introduces SVBench, a benchmark explicitly designed for evaluating LVLMs in long-context streaming video understanding. I believe this fills a notable gap in existing benchmarks, which typically focus on isolated text inputs rather than sustained temporal reasoning across video streams. The comparison between the benchmarks also shows the advantages of SVBench. S2: The QA pairs in this work are annotated semi-automatically, making it a large-scale dataset with high-quality annotat
W1: The paper does not include a comparison with human performance. Incorporating such a comparison would provide valuable insights into the gap between current models and human capabilities in long-context streaming video understanding. W2: The paper does not analyze the impact of language model size on performance. Considering that models like InternVL2 have versions with 1B, 2B, 4B, 8B, 26B, 40B, and 72B parameters, and Video-LLaMA2 also have 72B versions, expanding experiments to include th
1. The SVBench proposed fill the gap between video benchmarks and streaming video understanding. In the real world, streaming video is a more challenging data form, so this benchmark has very important practical significance. 2. The authors proposed a semi-automatic annotation process and integrated multiple types of video data, which not only maintained the diversity of the benchmark, but also the accuracy of the annotation information and prevented hallucinations from affecting the benchmark r
1. Although the authors provide an intuitive expression of the proposed SVBench in terms of video types and the diversity of annotation information through visualization results, the authors lack a description of the distribution of video lengths, which may be important for a benchmark. 2. The training method of the StreamingChat architecture proposed by the author is similar to the recently proposed streaming video understanding model Video-online. I hope the author can explain the difference b
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Video Analysis and Summarization · Multimedia Communication and Technology
MethodsStreaming Module
