StreamChat: Chatting with Streaming Video
Jihao Liu, Zhiding Yu, Shiyi Lan, Shihao Wang, Rongyao Fang, Jan, Kautz, Hongsheng Li, Jose M. Alvare

TL;DR
StreamChat introduces a dynamic, real-time visual context updating mechanism for Large Multimodal Models, significantly improving streaming video interaction capabilities with a novel architecture and dataset.
Contribution
It proposes a new approach for streaming video interaction in LMMs by updating visual context at each decoding step and introduces a dense instruction dataset for training.
Findings
Achieves competitive performance on image and video benchmarks.
Outperforms state-of-the-art models in streaming interaction scenarios.
Maintains inference efficiency with a new crossattention architecture.
Abstract
This paper presents StreamChat, a novel approach that enhances the interaction capabilities of Large Multimodal Models (LMMs) with streaming video content. In streaming interaction scenarios, existing methods rely solely on visual information available at the moment a question is posed, resulting in significant delays as the model remains unaware of subsequent changes in the streaming video. StreamChat addresses this limitation by innovatively updating the visual context at each decoding step, ensuring that the model utilizes up-to-date video content throughout the decoding process. Additionally, we introduce a flexible and efficient crossattention-based architecture to process dynamic streaming inputs while maintaining inference efficiency for streaming interactions. Furthermore, we construct a new dense instruction dataset to facilitate the training of streaming interaction models,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimedia Communication and Technology · Speech and dialogue systems · Digital Communication and Language
