StreamEQA: Towards Streaming Video Understanding for Embodied Scenarios

Yifei Wang; Zhenkai Li; Tianwen Qian; Huanran Zheng; Zheng Wang; Yuqian Fu; Xiaoling Wang

arXiv:2512.04451·cs.CV·December 5, 2025

StreamEQA: Towards Streaming Video Understanding for Embodied Scenarios

Yifei Wang, Zhenkai Li, Tianwen Qian, Huanran Zheng, Zheng Wang, Yuqian Fu, Xiaoling Wang

PDF

Open Access

TL;DR

StreamEQA is a novel benchmark for evaluating streaming video question answering in embodied scenarios, emphasizing perception, interaction, and planning across different temporal contexts to advance embodied intelligence.

Contribution

We introduce StreamEQA, the first benchmark specifically designed for streaming video question answering in embodied environments, with comprehensive tasks and evaluation protocols.

Findings

01

Existing models perform poorly on streaming embodied tasks

02

StreamEQA covers 42 tasks with 21K questions and timestamps

03

Evaluation reveals gaps in current video-LLMs' streaming understanding

Abstract

As embodied intelligence advances toward real-world deployment, the ability to continuously perceive and reason over streaming visual inputs becomes essential. In such settings, an agent must maintain situational awareness of its environment, comprehend the interactions with surrounding entities, and dynamically plan actions informed by past observations, current contexts, and anticipated future events. To facilitate progress in this direction, we introduce StreamEQA, the first benchmark designed for streaming video question answering in embodied scenarios. StreamEQA evaluates existing MLLMs along two orthogonal dimensions: Embodied and Streaming. Along the embodied dimension, we categorize the questions into three levels: perception, interaction, and planning, which progressively assess a model's ability to recognize fine-grained visual details, reason about agent-object interactions,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Human Pose and Action Recognition