Maximizing Asynchronicity in Event-based Neural Networks

Haiqing Hao; Nikola Zubi\'c; Weihua He; Zhipeng Sui; Davide Scaramuzza; Wenhui Wang

arXiv:2505.11165·cs.LG·March 9, 2026

Maximizing Asynchronicity in Event-based Neural Networks

Haiqing Hao, Nikola Zubi\'c, Weihua He, Zhipeng Sui, Davide Scaramuzza, Wenhui Wang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces EVA, a novel asynchronous feature learning framework for event cameras that enhances expressivity and generalizability, leading to superior recognition and detection performance in real-time vision tasks.

Contribution

EVA uniquely adapts language modeling techniques to event-based neural networks, significantly improving their ability to handle asynchronous event data.

Findings

01

EVA outperforms prior A2S methods on recognition tasks.

02

EVA achieves 0.477 mAP on the Gen1 detection dataset.

03

First A2S framework to excel in demanding detection tasks.

Abstract

Event cameras deliver visual data with high temporal resolution, low latency, and minimal redundancy, yet their asynchronous, sparse sequential nature challenges standard tensor-based machine learning (ML). While the recent asynchronous-to-synchronous (A2S) paradigm aims to bridge this gap by asynchronously encoding events into learned features for ML pipelines, existing A2S approaches often sacrifice expressivity and generalizability compared to dense, synchronous methods. This paper introduces EVA (EVent Asynchronous feature learning), a novel A2S framework to generate highly expressive and generalizable event-by-event features. Inspired by the analogy between events and language, EVA uniquely adapts advances from language modeling in linear attention and self-supervised learning for its construction. In demonstration, EVA outperforms prior A2S methods on recognition tasks…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

* Tackles a timely challenge as it is yet unclear how to process event camera inputs * Interesting ideas to use a token based approach with a linear transformer * The self-supervised aspect with the multi-representation training is seemingly novel and interesting

Weaknesses

Overall, the major concerns I have with this work are that the results are not very compelling. The authors only evaluate two task (object detection and recognition). Given the authors are proposing a new architecture, it would be good to have at least another slightly different task to demonstrate the robustness of their architecture is not clear. The proposed approach achieves does not achieve the best accuracy or latency, but is another tradeoff point in the space. If the key benefits ar

Reviewer 02Rating 4Confidence 5

Strengths

The proposed framework is well-structured and systematically designed. The overall pipeline is coherent, covering event tokenization, temporal-difference encoding, MVHS feature generation, and self-supervised tasks in an end-to-end manner. The experimental evaluation is comprehensive; the paper validates the generality of EVA across several datasets and tasks, includes meaningful ablation studies, and provides detailed experimental configurations. The conceptual alignment between event represe

Weaknesses

In the supplementary material (Fig. 6(c)), it is unclear whether the visualized features correspond to the MVHS outputs. Section E. Visualization lacks a clear description or interpretation of these results. While performance metrics on N-Cars and Gen1 are reported, latency, throughput, and scalability analyses are missing. Since A2S methods can, in principle, operate at arbitrary event sampling frequencies, it is important to analyze how performance and efficiency vary with sequence length.

Reviewer 03Rating 6Confidence 5

Strengths

1. The paper draws an insightful analogy between event streams and language sequences, introducing a linear attention mechanism and self-supervised learning to propose a novel and efficient paradigm for asynchronous event processing. 2. The authors conduct extensive evaluations on multiple public datasets (DVS128-Gesture, N-Cars, and Gen1), covering both recognition and detection tasks. The results are convincing and demonstrate clear advantages over various baseline methods. 3. The manuscript

Weaknesses

1. As an A2S framework, a key question is how to achieve better asynchronous event representation. While I acknowledge that the authors have already included sufficient baseline comparisons, I wonder whether they have considered fixing the task-specific backbone and varying only the event representations, as described in Section 2.1. For example, in the object detection task on Gen1, one could fix a backbone (e.g., RVT-B) and compare different existing representation methods (e.g., time-surface,

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · Ferroelectric and Negative Capacitance Devices · Neural Networks and Reservoir Computing

MethodsSoftmax · Attention Is All You Need