EventVL: Understand Event Streams via Multimodal Large Language Model
Pengteng Li, Yunfan Lu, Pinghao Song, Wuyang Li, Huizai Yao, Hui Xiong

TL;DR
EventVL is a multimodal large language model designed for explicit semantic understanding of event streams, leveraging a large annotated dataset and novel representation techniques to outperform existing models in event captioning and scene description.
Contribution
We introduce EventVL, the first generative event-based multimodal LLM with a large annotated dataset and new semantic representation methods for improved event understanding.
Findings
EventVL significantly outperforms existing MLLMs in event captioning.
The annotated dataset contains 1.4 million high-quality event-image/video-text pairs.
Proposed methods enhance semantic understanding and scene description capabilities.
Abstract
The event-based Vision-Language Model (VLM) recently has made good progress for practical vision tasks. However, most of these works just utilize CLIP for focusing on traditional perception tasks, which obstruct model understanding explicitly the sufficient semantics and context from event streams. To address the deficiency, we propose EventVL, the first generative event-based MLLM (Multimodal Large Language Model) framework for explicit semantic understanding. Specifically, to bridge the data gap for connecting different modalities semantics, we first annotate a large event-image/video-text dataset, containing almost 1.4 million high-quality pairs of data, which enables effective learning across various scenes, e.g., drive scene or human motion. After that, we design Event Spatiotemporal Representation to fully explore the comprehensive information by diversely aggregating and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Semantic Web and Ontologies · Advanced Text Analysis Techniques
MethodsContrastive Language-Image Pre-training
