VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models

Pavan Kumar Anasosalu Vasu; Cem Koc; Fartash Faghri; Chun-Liang Li; Bo Feng; Zhengfeng Lai; Meng Cao; Oncel Tuzel; Hadi Pouransari

arXiv:2604.07634·cs.CV·May 7, 2026

VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models

Pavan Kumar Anasosalu Vasu, Cem Koc, Fartash Faghri, Chun-Liang Li, Bo Feng, Zhengfeng Lai, Meng Cao, Oncel Tuzel, Hadi Pouransari

PDF

1 Repo

TL;DR

VSAS-Bench introduces a comprehensive framework and benchmark for evaluating real-time visual streaming assistant models, emphasizing metrics like proactiveness and consistency in diverse, temporally dense video tasks.

Contribution

It provides a new benchmark with standardized evaluation protocols and extensive annotations, enabling large-scale assessment of streaming VLMs' capabilities and trade-offs.

Findings

01

Conventional VLMs can be adapted to streaming without extra training.

02

Adapted models outperform recent streaming VLMs, e.g., Qwen3-VL-4B beats Dispider by 3%.

03

The framework reveals key factors affecting accuracy-latency trade-offs.

Abstract

Streaming vision-language models (VLMs) continuously generate responses given an instruction prompt and an online stream of input frames. This is a core mechanism for real-time visual assistants. Existing VLM frameworks predominantly assess models in offline settings. In contrast, the performance of a streaming VLM depends on additional metrics beyond pure video understanding, including proactiveness, which reflects the timeliness of the model's responses, and consistency, which captures the robustness of its responses over time. To address this limitation, we propose VSAS-Bench, a new framework and benchmark for Visual Streaming Assistants. In contrast to prior benchmarks that primarily employ single-turn question answering on video inputs, VSAS-Bench features temporally dense annotations with over 18,000 annotations across diverse input domains and task types. We introduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

apple/ml-vsas-bench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.