Segment Any Events with Language
Seungjun Lee, Gim Hee Lee

TL;DR
SEAL is a novel framework for open-vocabulary event instance segmentation that supports multi-level granularity and outperforms baselines in speed and accuracy, advancing scene understanding with event sensors.
Contribution
Introduces SEAL, the first semantic-aware framework for open-vocabulary event segmentation supporting multiple granularity levels and extensive benchmark evaluation.
Findings
SEAL outperforms baselines in accuracy and inference speed.
Curated four benchmarks covering various label granularities.
A variant of SEAL achieves spatiotemporal segmentation without visual prompts.
Abstract
Scene understanding with free-form language has been widely explored within diverse modalities such as images, point clouds, and LiDAR. However, related studies on event sensors are scarce or narrowly centered on semantic-level understanding. We introduce SEAL, the first Semantic-aware Segment Any Events framework that addresses Open-Vocabulary Event Instance Segmentation (OV-EIS). Given the visual prompt, our model presents a unified framework to support both event segmentation and open-vocabulary mask classification at multiple levels of granularity, including instance-level and part-level. To enable thorough evaluation on OV-EIS, we curate four benchmarks that cover label granularity from coarse to fine class configurations and semantic granularity from instance-level to part-level understanding. Extensive experiments show that our SEAL largely outperforms proposed baselines in terms…
Peer Reviews
Decision·ICLR 2026 Poster
Strong performance against baselines
- Limited novelty: the method is essentially a merger of several pretrained foundation models and their data in a clever way. - Limited ablations: qualitative ablations on the Spatial Encoding or Mask Feature Enhancer are missing or not well explained; I did not understand Tables 4 and 5.
This work is shown to outperform MaskCLIP and other leading open-vocabulary baselines by 3.4 AP on DSEC11-Ins and 3.2 AP on DDD17-Ins. Inference is 5-18x faster with fewer than 1/5th the parameters. This is a good contribution towards practical application. Secondly, unlike prior methods, this work’s two-stage mask feature enhancement and spatial encoding overcome the "dead mask" issue, where small-event region masks are mapped to zero vectors; UMAP visualisations show tight semantic separation
* DSEC19-Ins is a highly fine-grained dataset: on it, the improvement over MaskCLIP narrows to just 0.7 AP. This, it seems to me that, even with annotation-free training, suggests that the distilled representations are less robust when class granularity exceeds the capacity of available MHSG cues. * I think that the main variant of the method benefits from GT-derived visual prompts for mask proposals; although a supplementary "prompt-free" variant exists, the claim of real-world flexibility is l
* This paper introduce a "Segment Any Events" framework, which can generate open-world semantic predictions for event masks. * This paper attempt to address OV-EIS that supports free-form of language queries. * This paper propose four benchmarks for evaluation.
* The architecture seems rather complex. Could the authors provide a clearer motivation for the inclusion of these modules? Are all of these modules necessary? Is there a simpler approach that could achieve the same results? Besides, it appears that the method is transferring concepts from open-vocabulary image segmentation techniques, such as OpenSeg, MaskCLIP, MaskCLIP++, and OVSeg to the event modality. Could the authors clarify this adaptation and its justification? * The paper claims to ha
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Human Pose and Action Recognition
