EventVL: Understand Event Streams via Multimodal Large Language Model

Pengteng Li; Yunfan Lu; Pinghao Song; Wuyang Li; Huizai Yao; Hui Xiong

arXiv:2501.13707·cs.CV·September 24, 2025

EventVL: Understand Event Streams via Multimodal Large Language Model

Pengteng Li, Yunfan Lu, Pinghao Song, Wuyang Li, Huizai Yao, Hui Xiong

PDF

Open Access

TL;DR

EventVL is a multimodal large language model designed for explicit semantic understanding of event streams, leveraging a large annotated dataset and novel representation techniques to outperform existing models in event captioning and scene description.

Contribution

We introduce EventVL, the first generative event-based multimodal LLM with a large annotated dataset and new semantic representation methods for improved event understanding.

Findings

01

EventVL significantly outperforms existing MLLMs in event captioning.

02

The annotated dataset contains 1.4 million high-quality event-image/video-text pairs.

03

Proposed methods enhance semantic understanding and scene description capabilities.

Abstract

The event-based Vision-Language Model (VLM) recently has made good progress for practical vision tasks. However, most of these works just utilize CLIP for focusing on traditional perception tasks, which obstruct model understanding explicitly the sufficient semantics and context from event streams. To address the deficiency, we propose EventVL, the first generative event-based MLLM (Multimodal Large Language Model) framework for explicit semantic understanding. Specifically, to bridge the data gap for connecting different modalities semantics, we first annotate a large event-image/video-text dataset, containing almost 1.4 million high-quality pairs of data, which enables effective learning across various scenes, e.g., drive scene or human motion. After that, we design Event Spatiotemporal Representation to fully explore the comprehensive information by diversely aggregating and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Semantic Web and Ontologies · Advanced Text Analysis Techniques

MethodsContrastive Language-Image Pre-training