Maximizing Asynchronicity in Event-based Neural Networks
Haiqing Hao, Nikola Zubi\'c, Weihua He, Zhipeng Sui, Davide Scaramuzza, Wenhui Wang

TL;DR
This paper introduces EVA, a novel asynchronous feature learning framework for event cameras that enhances expressivity and generalizability, leading to superior recognition and detection performance in real-time vision tasks.
Contribution
EVA uniquely adapts language modeling techniques to event-based neural networks, significantly improving their ability to handle asynchronous event data.
Findings
EVA outperforms prior A2S methods on recognition tasks.
EVA achieves 0.477 mAP on the Gen1 detection dataset.
First A2S framework to excel in demanding detection tasks.
Abstract
Event cameras deliver visual data with high temporal resolution, low latency, and minimal redundancy, yet their asynchronous, sparse sequential nature challenges standard tensor-based machine learning (ML). While the recent asynchronous-to-synchronous (A2S) paradigm aims to bridge this gap by asynchronously encoding events into learned features for ML pipelines, existing A2S approaches often sacrifice expressivity and generalizability compared to dense, synchronous methods. This paper introduces EVA (EVent Asynchronous feature learning), a novel A2S framework to generate highly expressive and generalizable event-by-event features. Inspired by the analogy between events and language, EVA uniquely adapts advances from language modeling in linear attention and self-supervised learning for its construction. In demonstration, EVA outperforms prior A2S methods on recognition tasks…
Peer Reviews
Decision·ICLR 2026 Poster
* Tackles a timely challenge as it is yet unclear how to process event camera inputs * Interesting ideas to use a token based approach with a linear transformer * The self-supervised aspect with the multi-representation training is seemingly novel and interesting
Overall, the major concerns I have with this work are that the results are not very compelling. The authors only evaluate two task (object detection and recognition). Given the authors are proposing a new architecture, it would be good to have at least another slightly different task to demonstrate the robustness of their architecture is not clear. The proposed approach achieves does not achieve the best accuracy or latency, but is another tradeoff point in the space. If the key benefits ar
The proposed framework is well-structured and systematically designed. The overall pipeline is coherent, covering event tokenization, temporal-difference encoding, MVHS feature generation, and self-supervised tasks in an end-to-end manner. The experimental evaluation is comprehensive; the paper validates the generality of EVA across several datasets and tasks, includes meaningful ablation studies, and provides detailed experimental configurations. The conceptual alignment between event represe
In the supplementary material (Fig. 6(c)), it is unclear whether the visualized features correspond to the MVHS outputs. Section E. Visualization lacks a clear description or interpretation of these results. While performance metrics on N-Cars and Gen1 are reported, latency, throughput, and scalability analyses are missing. Since A2S methods can, in principle, operate at arbitrary event sampling frequencies, it is important to analyze how performance and efficiency vary with sequence length.
1. The paper draws an insightful analogy between event streams and language sequences, introducing a linear attention mechanism and self-supervised learning to propose a novel and efficient paradigm for asynchronous event processing. 2. The authors conduct extensive evaluations on multiple public datasets (DVS128-Gesture, N-Cars, and Gen1), covering both recognition and detection tasks. The results are convincing and demonstrate clear advantages over various baseline methods. 3. The manuscript
1. As an A2S framework, a key question is how to achieve better asynchronous event representation. While I acknowledge that the authors have already included sufficient baseline comparisons, I wonder whether they have considered fixing the task-specific backbone and varying only the event representations, as described in Section 2.1. For example, in the object detection task on Gen1, one could fix a backbone (e.g., RVT-B) and compare different existing representation methods (e.g., time-surface,
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Ferroelectric and Negative Capacitance Devices · Neural Networks and Reservoir Computing
MethodsSoftmax · Attention Is All You Need
