Scaling Dense Event-Stream Pretraining from Visual Foundation Models

Zhiwen Chen; Junhui Hou; Zhiyu Zhu; Jinjian Wu; Guangming Shi

arXiv:2603.03969·cs.CV·March 5, 2026

Scaling Dense Event-Stream Pretraining from Visual Foundation Models

Zhiwen Chen, Junhui Hou, Zhiyu Zhu, Jinjian Wu, Guangming Shi

PDF

Open Access

TL;DR

This paper introduces a self-supervised pretraining approach that leverages visual foundation models to improve dense event-stream representations, addressing challenges of annotation scarcity and domain mismatch.

Contribution

It presents a novel structure-aware distillation method that aligns image and event data at the semantic level, significantly enhancing event representation quality.

Findings

01

Achieves state-of-the-art results on downstream benchmarks.

02

Demonstrates improved generalization and data efficiency.

03

Surpasses traditional and existing pretraining methods.

Abstract

Learning versatile, fine-grained representations from irregular event streams is pivotal yet nontrivial, primarily due to the heavy annotation that hinders scalability in dataset size, semantic richness, and application scope. To mitigate this dilemma, we launch a novel self-supervised pretraining method that distills visual foundation models (VFMs) to push the boundaries of event representation at scale. Specifically, we curate an extensive synchronized image-event collection to amplify cross-modal alignment. Nevertheless, due to inherent mismatches in sparsity and granularity between image-event domains, existing distillation paradigms are prone to semantic collapse in event representations, particularly at high resolutions. To bridge this gap, we propose to extend the alignment objective to semantic structures provided off-the-shelf by VFMs, indicating a broader receptive field and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning