Segment Any Events with Language

Seungjun Lee; Gim Hee Lee

arXiv:2601.23159·cs.CV·February 2, 2026

Segment Any Events with Language

Seungjun Lee, Gim Hee Lee

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

SEAL is a novel framework for open-vocabulary event instance segmentation that supports multi-level granularity and outperforms baselines in speed and accuracy, advancing scene understanding with event sensors.

Contribution

Introduces SEAL, the first semantic-aware framework for open-vocabulary event segmentation supporting multiple granularity levels and extensive benchmark evaluation.

Findings

01

SEAL outperforms baselines in accuracy and inference speed.

02

Curated four benchmarks covering various label granularities.

03

A variant of SEAL achieves spatiotemporal segmentation without visual prompts.

Abstract

Scene understanding with free-form language has been widely explored within diverse modalities such as images, point clouds, and LiDAR. However, related studies on event sensors are scarce or narrowly centered on semantic-level understanding. We introduce SEAL, the first Semantic-aware Segment Any Events framework that addresses Open-Vocabulary Event Instance Segmentation (OV-EIS). Given the visual prompt, our model presents a unified framework to support both event segmentation and open-vocabulary mask classification at multiple levels of granularity, including instance-level and part-level. To enable thorough evaluation on OV-EIS, we curate four benchmarks that cover label granularity from coarse to fine class configurations and semantic granularity from instance-level to part-level understanding. Extensive experiments show that our SEAL largely outperforms proposed baselines in terms…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

Strong performance against baselines

Weaknesses

- Limited novelty: the method is essentially a merger of several pretrained foundation models and their data in a clever way. - Limited ablations: qualitative ablations on the Spatial Encoding or Mask Feature Enhancer are missing or not well explained; I did not understand Tables 4 and 5.

Reviewer 02Rating 8Confidence 3

Strengths

This work is shown to outperform MaskCLIP and other leading open-vocabulary baselines by 3.4 AP on DSEC11-Ins and 3.2 AP on DDD17-Ins. Inference is 5-18x faster with fewer than 1/5th the parameters. This is a good contribution towards practical application. Secondly, unlike prior methods, this work’s two-stage mask feature enhancement and spatial encoding overcome the "dead mask" issue, where small-event region masks are mapped to zero vectors; UMAP visualisations show tight semantic separation

Weaknesses

* DSEC19-Ins is a highly fine-grained dataset: on it, the improvement over MaskCLIP narrows to just 0.7 AP. This, it seems to me that, even with annotation-free training, suggests that the distilled representations are less robust when class granularity exceeds the capacity of available MHSG cues. * I think that the main variant of the method benefits from GT-derived visual prompts for mask proposals; although a supplementary "prompt-free" variant exists, the claim of real-world flexibility is l

Reviewer 03Rating 4Confidence 4

Strengths

* This paper introduce a "Segment Any Events" framework, which can generate open-world semantic predictions for event masks. * This paper attempt to address OV-EIS that supports free-form of language queries. * This paper propose four benchmarks for evaluation.

Weaknesses

* The architecture seems rather complex. Could the authors provide a clearer motivation for the inclusion of these modules? Are all of these modules necessary? Is there a simpler approach that could achieve the same results? Besides, it appears that the method is transferring concepts from open-vocabulary image segmentation techniques, such as OpenSeg, MaskCLIP, MaskCLIP++, and OVSeg to the event modality. Could the authors clarify this adaptation and its justification? * The paper claims to ha

Code & Models

Datasets

onandon/SEAL
dataset· 27k dl
27k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Human Pose and Action Recognition