RIVER: A Real-Time Interaction Benchmark for Video LLMs

Yansong Shi; Qingsong Zhao; Tianxiang Jiang; Xiangyu Zeng; Yi Wang; Limin Wang

arXiv:2603.03985·cs.CV·March 5, 2026

RIVER: A Real-Time Interaction Benchmark for Video LLMs

Yansong Shi, Qingsong Zhao, Tianxiang Jiang, Xiangyu Zeng, Yi Wang, Limin Wang

PDF

Open Access 2 Datasets 3 Reviews

TL;DR

This paper introduces RIVER, a benchmark for evaluating real-time video understanding in large language models, highlighting current limitations and proposing improvements for online interaction capabilities.

Contribution

The paper presents RIVER, a novel benchmark with tasks mimicking interactive video dialogue, and proposes a method to enhance models' real-time interaction abilities.

Findings

01

Offline models excel in single QA tasks but struggle in real-time interaction.

02

Models lack long-term memory and future perception in online video understanding.

03

Proposed improvements enable more flexible real-time user interactions.

Abstract

The rapid advancement of multimodal large language models has demonstrated impressive capabilities, yet nearly all operate in an offline paradigm, hindering real-time interactivity. Addressing this gap, we introduce the Real-tIme Video intERaction Bench (RIVER Bench), designed for evaluating online video comprehension. RIVER Bench introduces a novel framework comprising Retrospective Memory, Live-Perception, and Proactive Anticipation tasks, closely mimicking interactive dialogues rather than responding to entire videos at once. We conducted detailed annotations using videos from diverse sources and varying lengths, and precisely defined the real-time interactive format. Evaluations across various model categories reveal that while offline models perform well in single question-answering tasks, they struggle with real-time processing. Addressing the limitations of existing models in…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

• Clear, timely problem: shifts evaluation from offline video QA to interactive, streaming settings with explicit timing of query/cue/response. • Three-facet design: jointly measures recall, live perception, and anticipation, and ties performance to the temporal gap \Delta (forgetting/anticipation curves). • Protocol precision: items carry exact timestamps; “instant” vs “stream” anticipation reflects real interaction patterns (trigger vs continuous narration). • Methodological baselines: show

Weaknesses

• Data novelty & provenance: benchmark reuses existing datasets; novelty is primarily the reconstruction into an online protocol. That’s valuable, but the paper should be transparent about how much is new annotation vs relabeling/retiming, and quantify human effort & agreement. • Mixed evaluation formats: retro-memory remains MCQ, while other parts use open-ended judged by Qwen2.5-72B. This mixture complicates cross-task comparability and may inherit LLM-judge biases (version drift, style sensi

Reviewer 02Rating 6Confidence 3

Strengths

1. The paper creates an accurate online task formalization with retro/live/pro-anticipation split and timing semantics. In addition, the windowed formulation ties the accuracy to when the answer is generated. 2. Curated and broad construction across long videos, equipped with explicit filtering to mitigate language-only priors and ambiguous items. 3. Operational online protocol that makes many offline models evaluable in real time, which enables informative cross-family comparisons. 4. Comparati

Weaknesses

1. OE scoring depends on Qwen2.5-72B; prompts, thresholds, and sensitivity analyses are not deeply reported. The Res Acc window width and tolerance are not exhaustively justified; in addition, user-centric latency–utility tradeoffs are not evaluated as well. 2. From my perspective, even though the pipeline filters items that are language-answerable, deep LLM participation risks unintended stylistic mimicry and inherent bias.. More transparent human IAA and QA metrics would be quite helpful. 3. T

Reviewer 03Rating 4Confidence 3

Strengths

1. The paper fills a clear gap by introducing RIVER Bench for real-time video interaction, moving beyond the traditional offline paradigm. Its design with Retrospective Memory, Live Perception, and Proactive Anticipation tasks realistically mimics dynamic, interactive scenarios. 2. The dataset is diverse and well-annotated, combining multiple video sources with fine-grained temporal labeling and strong quality control, ensuring high reliability. 3. The experiments are comprehensive, covering v

Weaknesses

1. RIVER Bench only supports video-text interaction, not including audio, which is crucial for real-time tasks like voice navigation or human-robot interaction. While this is mentioned as a limitation, it would be helpful to test and report ASR performance, which would increase the benchmark’s practical value. 2. The data primarily comes from Ego4D-Narration, focusing on simple, static tasks like desk operations and furniture organization, and lacks more complex dynamic scenarios such as traffi

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition