Generative Event Pretraining with Foundation Model Alignment

Jianwen Cao; Jiaxu Xing; Nico Messikommer; Davide Scaramuzza

arXiv:2603.23032·cs.CV·April 6, 2026

Generative Event Pretraining with Foundation Model Alignment

Jianwen Cao, Jiaxu Xing, Nico Messikommer, Davide Scaramuzza

PDF

TL;DR

GEP is a two-stage framework that transfers semantic knowledge from image datasets to event data, enabling robust event-based visual foundation models for various tasks.

Contribution

It introduces a novel alignment and generative pretraining approach for event data, improving transferability and temporal understanding.

Findings

01

Outperforms state-of-the-art event pretraining methods.

02

Achieves strong results on object recognition, segmentation, and depth estimation.

03

Produces a semantically rich, temporally aware event model.

Abstract

Event cameras provide robust visual signals under fast motion and challenging illumination conditions thanks to their microsecond latency and high dynamic range. However, their unique sensing characteristics and limited labeled data make it challenging to train event-based visual foundation models (VFMs), which are crucial for learning visual features transferable across tasks. To tackle this problem, we propose GEP (Generative Event Pretraining), a two-stage framework that transfers semantic knowledge learned from internet-scale image datasets to event data while learning event-specific temporal dynamics. First, an event encoder is aligned to a frozen VFM through a joint regression-contrastive objective, grounding event features in image semantics. Second, a transformer backbone is autoregressively pretrained on mixed event-image sequences to capture the temporal structure unique to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.