StreamChat: Chatting with Streaming Video

Jihao Liu; Zhiding Yu; Shiyi Lan; Shihao Wang; Rongyao Fang; Jan; Kautz; Hongsheng Li; Jose M. Alvare

arXiv:2412.08646·cs.CV·April 1, 2025

StreamChat: Chatting with Streaming Video

Jihao Liu, Zhiding Yu, Shiyi Lan, Shihao Wang, Rongyao Fang, Jan, Kautz, Hongsheng Li, Jose M. Alvare

PDF

Open Access

TL;DR

StreamChat introduces a dynamic, real-time visual context updating mechanism for Large Multimodal Models, significantly improving streaming video interaction capabilities with a novel architecture and dataset.

Contribution

It proposes a new approach for streaming video interaction in LMMs by updating visual context at each decoding step and introduces a dense instruction dataset for training.

Findings

01

Achieves competitive performance on image and video benchmarks.

02

Outperforms state-of-the-art models in streaming interaction scenarios.

03

Maintains inference efficiency with a new crossattention architecture.

Abstract

This paper presents StreamChat, a novel approach that enhances the interaction capabilities of Large Multimodal Models (LMMs) with streaming video content. In streaming interaction scenarios, existing methods rely solely on visual information available at the moment a question is posed, resulting in significant delays as the model remains unaware of subsequent changes in the streaming video. StreamChat addresses this limitation by innovatively updating the visual context at each decoding step, ensuring that the model utilizes up-to-date video content throughout the decoding process. Additionally, we introduce a flexible and efficient crossattention-based architecture to process dynamic streaming inputs while maintaining inference efficiency for streaming interactions. Furthermore, we construct a new dense instruction dataset to facilitate the training of streaming interaction models,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimedia Communication and Technology · Speech and dialogue systems · Digital Communication and Language