PEARL: Personalized Streaming Video Understanding Model

Yuanhong Zheng; Ruichuan An; Xiaopeng Lin; Yuxing Liu; Sihan Yang; Huanyu Zhang; Haodong Li; Qintong Zhang; Renrui Zhang; Guopeng Li; Yifan Zhang; Yuheng Li; Wentao Zhang

arXiv:2603.20422·cs.CV·March 24, 2026

PEARL: Personalized Streaming Video Understanding Model

Yuanhong Zheng, Ruichuan An, Xiaopeng Lin, Yuxing Liu, Sihan Yang, Huanyu Zhang, Haodong Li, Qintong Zhang, Renrui Zhang, Guopeng Li, Yifan Zhang, Yuheng Li, Wentao Zhang

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a new task of Personalized Streaming Video Understanding (PSVU), a benchmark called PEARL-Bench, and a training-free baseline model PEARL, to enable real-time, personalized video comprehension for AI assistants.

Contribution

The paper defines the novel PSVU task, creates PEARL-Bench for evaluation, and proposes PEARL as a robust, training-free baseline to advance personalized streaming video understanding.

Findings

01

PEARL achieves state-of-the-art performance across multiple models.

02

PEARL improves personalization in 3 different architectures.

03

PEARL demonstrates robustness in real-time video understanding.

Abstract

Human cognition of new concepts is inherently a streaming process: we continuously recognize new objects or identities and update our memories over time. However, current multimodal personalization methods are largely limited to static images or offline videos. This disconnects continuous visual input from instant real-world feedback, limiting their ability to provide the real-time, interactive personalized responses essential for future AI assistants. To bridge this gap, we first propose and formally define the novel task of Personalized Streaming Video Understanding (PSVU). To facilitate research in this new direction, we introduce PEARL-Bench, the first comprehensive benchmark designed specifically to evaluate this challenging setting. It evaluates a model's ability to respond to personalized concepts at exact timestamps under two modes: (1) Frame-level, focusing on a specific person…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

zyh200727/PEARL-Data
dataset· 46 dl
46 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Social Robot Interaction and HRI