Generative Event Pretraining with Foundation Model Alignment
Jianwen Cao, Jiaxu Xing, Nico Messikommer, Davide Scaramuzza

TL;DR
GEP is a two-stage framework that transfers semantic knowledge from image datasets to event data, enabling robust event-based visual foundation models for various tasks.
Contribution
It introduces a novel alignment and generative pretraining approach for event data, improving transferability and temporal understanding.
Findings
Outperforms state-of-the-art event pretraining methods.
Achieves strong results on object recognition, segmentation, and depth estimation.
Produces a semantically rich, temporally aware event model.
Abstract
Event cameras provide robust visual signals under fast motion and challenging illumination conditions thanks to their microsecond latency and high dynamic range. However, their unique sensing characteristics and limited labeled data make it challenging to train event-based visual foundation models (VFMs), which are crucial for learning visual features transferable across tasks. To tackle this problem, we propose GEP (Generative Event Pretraining), a two-stage framework that transfers semantic knowledge learned from internet-scale image datasets to event data while learning event-specific temporal dynamics. First, an event encoder is aligned to a frozen VFM through a joint regression-contrastive objective, grounding event features in image semantics. Second, a transformer backbone is autoregressively pretrained on mixed event-image sequences to capture the temporal structure unique to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
