EBES: Easy Benchmarking for Event Sequences
Dmitry Osin, Igor Udovichenko, Viktor Moskvoretskii, Egor Shvetsov,, Evgeny Burnaev

TL;DR
EBES provides a standardized benchmarking framework and open-source tools for event sequence classification, addressing the lack of evaluation protocols and enabling reliable comparison of models across diverse datasets.
Contribution
This paper introduces EBES, a comprehensive benchmark with standardized evaluation protocols, a PyTorch library with 9 models, and the largest collection of EvS datasets for improved research consistency.
Findings
GRU-based models perform best among tested models
EvS classification presents unique challenges compared to other sequential data
Benchmarking results facilitate model comparison and reveal robustness issues
Abstract
Event Sequences (EvS) refer to sequential data characterized by irregular sampling intervals and a mix of categorical and numerical features. Accurate classification of these sequences is crucial for various real-life applications, including healthcare, finance, and user interaction. Despite the popularity of the EvS classification task, there is currently no standardized benchmark or rigorous evaluation protocol. This lack of standardization makes it difficult to compare results across studies, which can result in unreliable conclusions and hinder progress in the field. To address this gap, we present EBES, a comprehensive benchmark for EvS classification with sequence-level targets. EBES features standardized evaluation scenarios and protocols, along with an open-source PyTorch library that implements 9 modern models. Additionally, it includes the largest collection of EvS datasets,…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
Event sequences are a practically relevant problem setting that has indeed gotten relatively less attention from the ML community. Having benchmarking sets is important to progress the field, hence I applaud this effort.
The authors state on page 3: “One of the primary challenges in benchmarking is ensuring that the datasets used are high quality and **accurately represent the problem domain**. I very much agree with the observation that the value of benchmarking sets lies in their degree of representativity of the problem domain as a whole. Here I wonder if this is the case. Looking at Table 1, I see one dataset with a regression target (Pendulum, which is synthetic), and 5 datasets with a classification target
This paper provides experimental results of assessing several sequence models using different types of datasets, which could benefit future research in this field. This paper adopts the hyperparameter optimization and Monte-Carlos cross validation for a fair comparison among models.
Even though the paper compares time series data and event sequences in Figure 1, it fails to explain the differences in terms of research challenges of the prediction tasks, thus justifying the motivation and novelty of the proposed benchmark. In fact, several models selected in the paper to assess are the ones for time series data, not particularly for event sequences. The selection of the publicly available datasets does not show much diversity in terms of data scales, formats, difficulty le
EBES provides a unified framework including both datasets, models, and experimental protocols to facilitate reproducible research and consistent evaluations. EBES includes datasets from different application domains, such as medical records, transaction sequences, and synthetic datasets, enhancing its diversity, applicability and utility. The framework also offers a wide range of comprehensive models, such as GRU, Transformer, or specialized ones, mTAND and CoLES. This ensures a thorough evaluat
The proposed benchmark framework lacks a comparison with existing tools such as the multi-level task framework for event sequences [1] or [2], hence causes difficulties to assert its advantages over existing frameworks and tools. The dataset diversity is also limited. It is suggested that authors should extend their framework to other application domains, such as the smart city, weather and renewable energy, traffic, social network to further improve the applicability of their framework. Further
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Parallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems
MethodsLib
