EBES: Easy Benchmarking for Event Sequences

Dmitry Osin; Igor Udovichenko; Viktor Moskvoretskii; Egor Shvetsov,; Evgeny Burnaev

arXiv:2410.03399·cs.LG·February 27, 2025

EBES: Easy Benchmarking for Event Sequences

Dmitry Osin, Igor Udovichenko, Viktor Moskvoretskii, Egor Shvetsov,, Evgeny Burnaev

PDF

Open Access 1 Repo 3 Reviews

TL;DR

EBES provides a standardized benchmarking framework and open-source tools for event sequence classification, addressing the lack of evaluation protocols and enabling reliable comparison of models across diverse datasets.

Contribution

This paper introduces EBES, a comprehensive benchmark with standardized evaluation protocols, a PyTorch library with 9 models, and the largest collection of EvS datasets for improved research consistency.

Findings

01

GRU-based models perform best among tested models

02

EvS classification presents unique challenges compared to other sequential data

03

Benchmarking results facilitate model comparison and reveal robustness issues

Abstract

Event Sequences (EvS) refer to sequential data characterized by irregular sampling intervals and a mix of categorical and numerical features. Accurate classification of these sequences is crucial for various real-life applications, including healthcare, finance, and user interaction. Despite the popularity of the EvS classification task, there is currently no standardized benchmark or rigorous evaluation protocol. This lack of standardization makes it difficult to compare results across studies, which can result in unreliable conclusions and hinder progress in the field. To address this gap, we present EBES, a comprehensive benchmark for EvS classification with sequence-level targets. EBES features standardized evaluation scenarios and protocols, along with an open-source PyTorch library that implements 9 modern models. Additionally, it includes the largest collection of EvS datasets,…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 5Confidence 3

Strengths

Event sequences are a practically relevant problem setting that has indeed gotten relatively less attention from the ML community. Having benchmarking sets is important to progress the field, hence I applaud this effort.

Weaknesses

The authors state on page 3: “One of the primary challenges in benchmarking is ensuring that the datasets used are high quality and **accurately represent the problem domain**. I very much agree with the observation that the value of benchmarking sets lies in their degree of representativity of the problem domain as a whole. Here I wonder if this is the case. Looking at Table 1, I see one dataset with a regression target (Pendulum, which is synthetic), and 5 datasets with a classification target

Reviewer 02Rating 3Confidence 4

Strengths

This paper provides experimental results of assessing several sequence models using different types of datasets, which could benefit future research in this field. This paper adopts the hyperparameter optimization and Monte-Carlos cross validation for a fair comparison among models.

Weaknesses

Even though the paper compares time series data and event sequences in Figure 1, it fails to explain the differences in terms of research challenges of the prediction tasks, thus justifying the motivation and novelty of the proposed benchmark. In fact, several models selected in the paper to assess are the ones for time series data, not particularly for event sequences. The selection of the publicly available datasets does not show much diversity in terms of data scales, formats, difficulty le

Reviewer 03Rating 5Confidence 4

Strengths

EBES provides a unified framework including both datasets, models, and experimental protocols to facilitate reproducible research and consistent evaluations. EBES includes datasets from different application domains, such as medical records, transaction sequences, and synthetic datasets, enhancing its diversity, applicability and utility. The framework also offers a wide range of comprehensive models, such as GRU, Transformer, or specialized ones, mTAND and CoLES. This ensures a thorough evaluat

Weaknesses

The proposed benchmark framework lacks a comparison with existing tools such as the multi-level task framework for event sequences [1] or [2], hence causes difficulties to assert its advantages over existing frameworks and tools. The dataset diversity is also limited. It is suggested that authors should extend their framework to other application domains, such as the smart city, weather and renewable energy, traffic, social network to further improve the applicability of their framework. Further

Code & Models

Repositories

on-point-rnd/ebes
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Parallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems

MethodsLib