Proact-VL: A Proactive VideoLLM for Real-Time AI Companions

Weicai Yan; Yuhong Dai; Qi Ran; Haodong Li; Wang Lin; Hao Liao; Xing Xie; Tao Jin; Jianxun Lian

arXiv:2603.03447·cs.CV·March 24, 2026

Proact-VL: A Proactive VideoLLM for Real-Time AI Companions

Weicai Yan, Yuhong Dai, Qi Ran, Haodong Li, Wang Lin, Hao Liao, Xing Xie, Tao Jin, Jianxun Lian

PDF

Open Access

TL;DR

Proact-VL is a novel framework that enables real-time, proactive AI companions in gaming scenarios by addressing low-latency inference, autonomous response timing, and content quality control, demonstrated through a new large-scale benchmark.

Contribution

This work introduces Proact-VL, a general framework for proactive, real-time multimodal AI companions, along with the Live Gaming Benchmark dataset for evaluation.

Findings

01

Proact-VL achieves lower response latency compared to existing methods.

02

Proact-VL maintains high-quality, human-like interactions in gaming scenarios.

03

The framework demonstrates strong video understanding capabilities in real-time settings.

Abstract

Proactive and real-time interactive experiences are essential for human-like AI companions, yet face three key challenges: (1) achieving low-latency inference under continuous streaming inputs, (2) autonomously deciding when to respond, and (3) controlling both quality and quantity of generated content to meet real-time constraints. In this work, we instantiate AI companions through two gaming scenarios, commentator and guide, selected for their suitability for automatic evaluation. We introduce the Live Gaming Benchmark, a large-scale dataset with three representative scenarios: solo commentary, co-commentary, and user guidance, and present Proact-VL, a general framework that shapes multimodal language models into proactive, real-time interactive agents capable of human-like environment perception and interaction. Extensive experiments show Proact-VL achieves superior response latency…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Social Robot Interaction and HRI