VideoLLM-online: Online Video Large Language Model for Streaming Video

Joya Chen; Zhaoyang Lv; Shiwei Wu; Kevin Qinghong Lin; Chenan Song,; Difei Gao; Jia-Wei Liu; Ziteng Gao; Dongxing Mao; Mike Zheng Shou

arXiv:2406.11816·cs.CV·June 18, 2024·3 cites

VideoLLM-online: Online Video Large Language Model for Streaming Video

Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song,, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, Mike Zheng Shou

PDF

Open Access 1 Models 2 Datasets

TL;DR

This paper introduces VideoLLM-online, a novel framework for real-time, long-context streaming video dialogue, enabling large language models to process continuous video streams efficiently and effectively.

Contribution

It proposes the LIVE framework with new training, data generation, and inference methods for streaming video dialogue, and demonstrates significant real-time processing capabilities.

Findings

01

Supports over 10 FPS on 5-minute videos

02

Achieves state-of-the-art results on offline benchmarks

03

Enables long-context, real-time video conversations

Abstract

Recent Large Language Models have been enhanced with vision capabilities, enabling them to comprehend images, videos, and interleaved vision-language content. However, the learning methods of these large multimodal models typically treat videos as predetermined clips, making them less effective and efficient at handling streaming video inputs. In this paper, we propose a novel Learning-In-Video-Stream (LIVE) framework, which enables temporally aligned, long-context, and real-time conversation within a continuous video stream. Our LIVE framework comprises comprehensive approaches to achieve video streaming dialogue, encompassing: (1) a training objective designed to perform language modeling for continuous streaming inputs, (2) a data generation scheme that converts offline temporal annotations into a streaming dialogue format, and (3) an optimized inference pipeline to speed up the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
chenjoya/videollm-online-8b-v1plus
model· 8.4k dl· ♡ 30
8.4k dl♡ 30

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimedia Communication and Technology

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Contrastive Language-Image Pre-training