EventFlash: Towards Efficient MLLMs for Event-Based Vision

Shaoyu Liu; Jianing Li; Guanghui Zhao; Yunjian Zhang; Wen Jiang; Ming Li; Xiangyang Ji

arXiv:2602.03230·cs.CV·February 4, 2026

EventFlash: Towards Efficient MLLMs for Event-Based Vision

Shaoyu Liu, Jianing Li, Guanghui Zhao, Yunjian Zhang, Wen Jiang, Ming Li, Xiangyang Ji

PDF

Open Access 3 Reviews

TL;DR

EventFlash introduces a novel, efficient multimodal large language model for event-based vision, leveraging spatiotemporal token sparsification to reduce computational costs while maintaining high performance in processing high-speed and low-light scenarios.

Contribution

The paper presents EventFlash, a new MLLM that employs spatiotemporal token sparsification, an adaptive temporal window, and a density-guided attention mechanism to improve efficiency and scalability.

Findings

01

12.4x throughput improvement over baseline

02

Supports long-range event stream processing up to 1,000 bins

03

Maintains comparable performance with significantly reduced computation

Abstract

Event-based multimodal large language models (MLLMs) enable robust perception in high-speed and low-light scenarios, addressing key limitations of frame-based MLLMs. However, current event-based MLLMs often rely on dense image-like processing paradigms, overlooking the spatiotemporal sparsity of event streams and resulting in high computational cost. In this paper, we propose EventFlash, a novel and efficient MLLM to explore spatiotemporal token sparsification for reducing data redundancy and accelerating inference. Technically, we build EventMind, a large-scale and scene-diverse dataset with over 500k instruction sets, providing both short and long event stream sequences to support our curriculum training strategy. We then present an adaptive temporal window aggregation module for efficient temporal sampling, which adaptively compresses temporal tokens while retaining key temporal…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

1. The creation of the 500k-sample EventMind dataset is a major contribution that addresses a critical resource gap for training and benchmarking event-based MLLMs. 2. The paper targets the correct bottleneck: the inefficiency of applying dense methods to sparse data. The proposed spatiotemporal sparsification modules (ATWA and SDGA) are an intuitive and direct solution to this problem.

Weaknesses

1. Insufficient comparison to SOTA event-based models: The paper fails to benchmark EventFlash against its direct competitors. On its new EventMind dataset, it only compares against frame-based models (Table 1). On the existing EventChat-Sub dataset, it only compares against EventGPT, omitting other SOTA event models like EventVL mentioned in the related work. This makes the SOTA performance claims unsubstantiated. 2. Missing Key Methodological Ablations: The core Adaptive Temporal Window Aggreg

Reviewer 02Rating 4Confidence 5

Strengths

1. A novel self-generated dataset that could be useful for the event-based MLLM research community. 2. Two interesting modules appear to help improve throughput while achieving comparable accuracy. 3. Some qualitative analyses are provided.

Weaknesses

1. The experimental comparison to actual SOTA MLLMs (e.g., Qwen2.5-VL, InternVL, LLaVA-v1.6) is critically undermined by the paper's failure to state whether these baselines were finetuned on event data. This strongly suggests an unfair comparison against zero-shot models, rendering the performance results unreliable. 2. The reliance on "RGB-style" (image-like) representations is a significant limitation. The method's effectiveness is never validated against true native event representations (e.

Reviewer 03Rating 6Confidence 4

Strengths

1. The paper presents a clear motivation. It effectively addresses the high temporal resolution and sparsity characteristics of event camera data, significantly reducing data redundancy and improving the efficiency of large language model (LLM) inference. 2. It provides a scene-driven large-scale event dataset, EventMind, which makes a valuable contribution to the advancement of this field. 3. The experimental results are comprehensive, including images, tables, videos, and code. The writing i

Weaknesses

1. The authors emphasize the advantages of their method in terms of efficiency and throughput. However, in Table 1, EventFlash-7B does not show a throughput advantage compared with EventGPT-7B. The authors should provide a discussion to clarify this discrepancy. 2. EventGPT-7B was not evaluated on the EventMind dataset. According to Figure 6, EventGPT-7B appears to work on the EventMind dataset, so why are there no quantitative comparison results presented in Table 1? The authors should explain

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications