RIVER: A Real-Time Interaction Benchmark for Video LLMs
Yansong Shi, Qingsong Zhao, Tianxiang Jiang, Xiangyu Zeng, Yi Wang, Limin Wang

TL;DR
This paper introduces RIVER, a benchmark for evaluating real-time video understanding in large language models, highlighting current limitations and proposing improvements for online interaction capabilities.
Contribution
The paper presents RIVER, a novel benchmark with tasks mimicking interactive video dialogue, and proposes a method to enhance models' real-time interaction abilities.
Findings
Offline models excel in single QA tasks but struggle in real-time interaction.
Models lack long-term memory and future perception in online video understanding.
Proposed improvements enable more flexible real-time user interactions.
Abstract
The rapid advancement of multimodal large language models has demonstrated impressive capabilities, yet nearly all operate in an offline paradigm, hindering real-time interactivity. Addressing this gap, we introduce the Real-tIme Video intERaction Bench (RIVER Bench), designed for evaluating online video comprehension. RIVER Bench introduces a novel framework comprising Retrospective Memory, Live-Perception, and Proactive Anticipation tasks, closely mimicking interactive dialogues rather than responding to entire videos at once. We conducted detailed annotations using videos from diverse sources and varying lengths, and precisely defined the real-time interactive format. Evaluations across various model categories reveal that while offline models perform well in single question-answering tasks, they struggle with real-time processing. Addressing the limitations of existing models in…
Peer Reviews
Decision·ICLR 2026 Poster
• Clear, timely problem: shifts evaluation from offline video QA to interactive, streaming settings with explicit timing of query/cue/response. • Three-facet design: jointly measures recall, live perception, and anticipation, and ties performance to the temporal gap \Delta (forgetting/anticipation curves). • Protocol precision: items carry exact timestamps; “instant” vs “stream” anticipation reflects real interaction patterns (trigger vs continuous narration). • Methodological baselines: show
• Data novelty & provenance: benchmark reuses existing datasets; novelty is primarily the reconstruction into an online protocol. That’s valuable, but the paper should be transparent about how much is new annotation vs relabeling/retiming, and quantify human effort & agreement. • Mixed evaluation formats: retro-memory remains MCQ, while other parts use open-ended judged by Qwen2.5-72B. This mixture complicates cross-task comparability and may inherit LLM-judge biases (version drift, style sensi
1. The paper creates an accurate online task formalization with retro/live/pro-anticipation split and timing semantics. In addition, the windowed formulation ties the accuracy to when the answer is generated. 2. Curated and broad construction across long videos, equipped with explicit filtering to mitigate language-only priors and ambiguous items. 3. Operational online protocol that makes many offline models evaluable in real time, which enables informative cross-family comparisons. 4. Comparati
1. OE scoring depends on Qwen2.5-72B; prompts, thresholds, and sensitivity analyses are not deeply reported. The Res Acc window width and tolerance are not exhaustively justified; in addition, user-centric latency–utility tradeoffs are not evaluated as well. 2. From my perspective, even though the pipeline filters items that are language-answerable, deep LLM participation risks unintended stylistic mimicry and inherent bias.. More transparent human IAA and QA metrics would be quite helpful. 3. T
1. The paper fills a clear gap by introducing RIVER Bench for real-time video interaction, moving beyond the traditional offline paradigm. Its design with Retrospective Memory, Live Perception, and Proactive Anticipation tasks realistically mimics dynamic, interactive scenarios. 2. The dataset is diverse and well-annotated, combining multiple video sources with fine-grained temporal labeling and strong quality control, ensuring high reliability. 3. The experiments are comprehensive, covering v
1. RIVER Bench only supports video-text interaction, not including audio, which is crucial for real-time tasks like voice navigation or human-robot interaction. While this is mentioned as a limitation, it would be helpful to test and report ASR performance, which would increase the benchmark’s practical value. 2. The data primarily comes from Ego4D-Narration, focusing on simple, static tasks like desk operations and furniture organization, and lacks more complex dynamic scenarios such as traffi
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
