Dispider: Enabling Video LLMs with Active Real-Time Interaction via   Disentangled Perception, Decision, and Reaction

Rui Qian; Shuangrui Ding; Xiaoyi Dong; Pan Zhang; Yuhang Zang; Yuhang; Cao; Dahua Lin; Jiaqi Wang

arXiv:2501.03218·cs.CV·January 7, 2025

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang, Cao, Dahua Lin, Jiaqi Wang

PDF

Open Access 1 Repo 1 Models

TL;DR

Dispider introduces a novel system for active real-time video interaction by disentangling perception, decision, and reaction processes, enabling timely, accurate, and efficient responses during streaming video analysis.

Contribution

We propose Dispider, a system that separates perception, decision, and reaction to overcome conflicts and enhance real-time video interaction capabilities.

Findings

01

Outperforms previous online models in streaming response tasks.

02

Maintains strong performance in conventional video QA.

03

Enables timely and contextually accurate interactions.

Abstract

Active Real-time interaction with video LLMs introduces a new paradigm for human-computer interaction, where the model not only understands user intent but also responds while continuously processing streaming video on the fly. Unlike offline video LLMs, which analyze the entire video before answering questions, active real-time interaction requires three capabilities: 1) Perception: real-time video monitoring and interaction capturing. 2) Decision: raising proactive interaction in proper situations, 3) Reaction: continuous interaction with users. However, inherent conflicts exist among the desired capabilities. The Decision and Reaction require a contrary Perception scale and grain, and the autoregressive decoding blocks the real-time Perception and Decision during the Reaction. To unify the conflicted capabilities within a harmonious system, we present Dispider, a system that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mark12ding/dispider
pytorchOfficial

Models

🤗
Mar2Ding/Dispider
model· 37 dl· ♡ 2
37 dl♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimedia Communication and Technology · Data Stream Mining Techniques