Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI
Jiangkai Wu, Zhiyuan Ren, Liming Liu, Xinggong Zhang

TL;DR
This paper introduces AI Video Chat, highlighting its unique challenges in low-latency video streaming for real-time human-AI interaction, and proposes a context-aware streaming method alongside a new benchmark for evaluation.
Contribution
It presents the first benchmark for AI video understanding under degraded conditions and proposes a novel context-aware streaming approach to reduce bitrate while maintaining AI accuracy.
Findings
Ultra-low bitrate is crucial for low latency in AI Video Chat.
Context-aware video streaming significantly reduces bitrate without sacrificing MLLM accuracy.
DeViBench provides a new standard for evaluating degraded video understanding in AI chat systems.
Abstract
AI Video Chat emerges as a new paradigm for Real-time Communication (RTC), where one peer is not a human, but a Multimodal Large Language Model (MLLM). This makes interaction between humans and AI more intuitive, as if chatting face-to-face with a real person. However, this poses significant challenges to latency, because the MLLM inference takes up most of the response time, leaving very little time for video streaming. Due to network uncertainty, transmission latency becomes a critical bottleneck preventing AI from being like a real person. To address this, we call for AI-oriented RTC research, exploring the network requirement shift from "humans watching video" to "AI understanding video". We begin by recognizing the main differences between AI Video Chat and traditional RTC. Then, through prototype measurements, we identify that ultra-low bitrate is a key factor for low latency. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
