PhoStream: Benchmarking Real-World Streaming for Omnimodal Assistants in Mobile Scenarios

Xudong Lu; Huankang Guan; Yang Bo; Jinpeng Chen; Xintong Guo; Shuhan Li; Fang Liu; Peiwen Sun; Xueying Li; Wei Zhang; Xue Yang; Rui Liu; Hongsheng Li

arXiv:2601.22575·cs.CV·February 2, 2026

PhoStream: Benchmarking Real-World Streaming for Omnimodal Assistants in Mobile Scenarios

Xudong Lu, Huankang Guan, Yang Bo, Jinpeng Chen, Xintong Guo, Shuhan Li, Fang Liu, Peiwen Sun, Xueying Li, Wei Zhang, Xue Yang, Rui Liu, Hongsheng Li

PDF

Open Access

TL;DR

PhoStream introduces a comprehensive mobile-centric streaming benchmark for evaluating multimodal large language models in real-world scenarios, revealing their strengths and limitations in continuous audio-visual reasoning tasks.

Contribution

It presents the first unified benchmark for on-screen and off-screen streaming scenarios, with a novel automated pipeline and realistic evaluation methods.

Findings

01

Models excel in Instant and Backward tasks but struggle with Forward tasks.

02

Current models tend to respond prematurely before visual and audio cues are available.

03

There is a fundamental challenge in models deciding when to speak during streaming.

Abstract

Multimodal Large Language Models excel at offline audio-visual understanding, but their ability to serve as mobile assistants in continuous real-world streams remains underexplored. In daily phone use, mobile assistants must track streaming audio-visual inputs and respond at the right time, yet existing benchmarks are often restricted to multiple-choice questions or use shorter videos. In this paper, we introduce PhoStream, the first mobile-centric streaming benchmark that unifies on-screen and off-screen scenarios to evaluate video, audio, and temporal reasoning. PhoStream contains 5,572 open-ended QA pairs from 578 videos across 4 scenarios and 10 capabilities. We build it with an Automated Generative Pipeline backed by rigorous human verification, and evaluate models using a realistic Online Inference Pipeline and LLM-as-a-Judge evaluation for open-ended responses. Experiments reveal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · AI in Service Interactions