TL;DR
StreamGaze introduces a new benchmark to evaluate how well Multimodal Large Language Models utilize gaze signals for temporal reasoning and proactive understanding in streaming videos, highlighting current limitations.
Contribution
This work presents the first benchmark specifically designed to assess gaze-guided reasoning in streaming video understanding with MLLMs, including a novel QA generation pipeline.
Findings
MLLMs lag behind humans in gaze-based streaming video tasks.
Gaze-guided reasoning reveals key limitations in current models.
Analysis suggests directions for improving gaze utilization in models.
Abstract
Streaming video understanding requires models not only to process temporally incoming frames, but also to anticipate user intention for realistic applications such as Augmented Reality (AR) glasses. While prior streaming benchmarks evaluate temporal reasoning, none measure whether Multimodal Large Language Models (MLLMs) can interpret or leverage human gaze signals within a streaming setting. To fill this gap, we introduce StreamGaze, the first benchmark designed to evaluate how effectively MLLMs utilize gaze for temporal and proactive reasoning in streaming videos. StreamGaze introduces gaze-guided past, present, and proactive tasks that comprehensively assess streaming video understanding. These tasks evaluate whether models can use real-time gaze signals to follow shifting attention and infer user intentions based only on past and currently observed frames. To build StreamGaze, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
