Scaling Dense Event-Stream Pretraining from Visual Foundation Models
Zhiwen Chen, Junhui Hou, Zhiyu Zhu, Jinjian Wu, Guangming Shi

TL;DR
This paper introduces a self-supervised pretraining approach that leverages visual foundation models to improve dense event-stream representations, addressing challenges of annotation scarcity and domain mismatch.
Contribution
It presents a novel structure-aware distillation method that aligns image and event data at the semantic level, significantly enhancing event representation quality.
Findings
Achieves state-of-the-art results on downstream benchmarks.
Demonstrates improved generalization and data efficiency.
Surpasses traditional and existing pretraining methods.
Abstract
Learning versatile, fine-grained representations from irregular event streams is pivotal yet nontrivial, primarily due to the heavy annotation that hinders scalability in dataset size, semantic richness, and application scope. To mitigate this dilemma, we launch a novel self-supervised pretraining method that distills visual foundation models (VFMs) to push the boundaries of event representation at scale. Specifically, we curate an extensive synchronized image-event collection to amplify cross-modal alignment. Nevertheless, due to inherent mismatches in sparsity and granularity between image-event domains, existing distillation paradigms are prone to semantic collapse in event representations, particularly at high resolutions. To bridge this gap, we propose to extend the alignment objective to semantic structures provided off-the-shelf by VFMs, indicating a broader receptive field and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
