EventBind: Learning a Unified Representation to Bind Them All for Event-based Open-world Understanding
Jiazhou Zhou, Xu Zheng, Yuanhuiyi Lyu, Lin Wang

TL;DR
EventBind introduces a unified framework that leverages vision-language models for event-based recognition, effectively addressing modality gaps and data scarcity by aligning image, text, and event representations.
Contribution
The paper presents a novel event encoder, a hybrid text prompt strategy, and a Hierarchical Triple Contrastive Alignment module for improved multi-modal event understanding.
Findings
Achieves state-of-the-art accuracy on N-Caltech101 and N-Imagenet benchmarks.
Effectively extends to event retrieval tasks with promising results.
Demonstrates robustness in fine-tuning and few-shot learning scenarios.
Abstract
In this paper, we propose EventBind, a novel and effective framework that unleashes the potential of vision-language models (VLMs) for event-based recognition to compensate for the lack of large-scale event-based datasets. In particular, due to the distinct modality gap with the image-text data and the lack of large-scale datasets, learning a common representation space for images, texts, and events is non-trivial.Intuitively, we need to address two key challenges: 1) how to generalize CLIP's visual encoder to event data while fully leveraging events' unique properties, e.g., sparsity and high temporal resolution; 2) how to effectively align the multi-modal embeddings, i.e., image, text, and events. Accordingly, we first introduce a novel event encoder that subtly models the temporal information from events and meanwhile, generates event prompts for modality bridging. We then design a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training · ALIGN
