PhoStream: Benchmarking Real-World Streaming for Omnimodal Assistants in Mobile Scenarios
Xudong Lu, Huankang Guan, Yang Bo, Jinpeng Chen, Xintong Guo, Shuhan Li, Fang Liu, Peiwen Sun, Xueying Li, Wei Zhang, Xue Yang, Rui Liu, Hongsheng Li

TL;DR
PhoStream introduces a comprehensive mobile-centric streaming benchmark for evaluating multimodal large language models in real-world scenarios, revealing their strengths and limitations in continuous audio-visual reasoning tasks.
Contribution
It presents the first unified benchmark for on-screen and off-screen streaming scenarios, with a novel automated pipeline and realistic evaluation methods.
Findings
Models excel in Instant and Backward tasks but struggle with Forward tasks.
Current models tend to respond prematurely before visual and audio cues are available.
There is a fundamental challenge in models deciding when to speak during streaming.
Abstract
Multimodal Large Language Models excel at offline audio-visual understanding, but their ability to serve as mobile assistants in continuous real-world streams remains underexplored. In daily phone use, mobile assistants must track streaming audio-visual inputs and respond at the right time, yet existing benchmarks are often restricted to multiple-choice questions or use shorter videos. In this paper, we introduce PhoStream, the first mobile-centric streaming benchmark that unifies on-screen and off-screen scenarios to evaluate video, audio, and temporal reasoning. PhoStream contains 5,572 open-ended QA pairs from 578 videos across 4 scenarios and 10 capabilities. We build it with an Automated Generative Pipeline backed by rigorous human verification, and evaluate models using a realistic Online Inference Pipeline and LLM-as-a-Judge evaluation for open-ended responses. Experiments reveal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · AI in Service Interactions
