EventBind: Learning a Unified Representation to Bind Them All for   Event-based Open-world Understanding

Jiazhou Zhou; Xu Zheng; Yuanhuiyi Lyu; Lin Wang

arXiv:2308.03135·cs.CV·July 25, 2024·5 cites

EventBind: Learning a Unified Representation to Bind Them All for Event-based Open-world Understanding

Jiazhou Zhou, Xu Zheng, Yuanhuiyi Lyu, Lin Wang

PDF

Open Access

TL;DR

EventBind introduces a unified framework that leverages vision-language models for event-based recognition, effectively addressing modality gaps and data scarcity by aligning image, text, and event representations.

Contribution

The paper presents a novel event encoder, a hybrid text prompt strategy, and a Hierarchical Triple Contrastive Alignment module for improved multi-modal event understanding.

Findings

01

Achieves state-of-the-art accuracy on N-Caltech101 and N-Imagenet benchmarks.

02

Effectively extends to event retrieval tasks with promising results.

03

Demonstrates robustness in fine-tuning and few-shot learning scenarios.

Abstract

In this paper, we propose EventBind, a novel and effective framework that unleashes the potential of vision-language models (VLMs) for event-based recognition to compensate for the lack of large-scale event-based datasets. In particular, due to the distinct modality gap with the image-text data and the lack of large-scale datasets, learning a common representation space for images, texts, and events is non-trivial.Intuitively, we need to address two key challenges: 1) how to generalize CLIP's visual encoder to event data while fully leveraging events' unique properties, e.g., sparsity and high temporal resolution; 2) how to effectively align the multi-modal embeddings, i.e., image, text, and events. Accordingly, we first introduce a novel event encoder that subtly models the temporal information from events and meanwhile, generates event prompts for modality bridging. We then design a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsContrastive Language-Image Pre-training · ALIGN