Creating Unbiased Public Benchmark Datasets with Data Leakage Prevention   for Predictive Process Monitoring

Hans Weytjens; Jochen De Weerdt

arXiv:2107.01905·cs.AI·July 6, 2021

Creating Unbiased Public Benchmark Datasets with Data Leakage Prevention for Predictive Process Monitoring

Hans Weytjens, Jochen De Weerdt

PDF

1 Repo

TL;DR

This paper addresses the challenge of creating unbiased, leakage-free benchmark datasets for predictive process monitoring by proposing standardized preprocessing steps to improve reproducibility and fair comparison.

Contribution

It introduces a systematic approach to prevent data leakage and bias in benchmark datasets, enhancing reproducibility and fairness in predictive process monitoring research.

Findings

01

Demonstrates the impact of data leakage on research results

02

Proposes preprocessing steps for unbiased dataset creation

03

Improves reproducibility and fairness in benchmarking

Abstract

Advances in AI, and especially machine learning, are increasingly drawing research interest and efforts towards predictive process monitoring, the subfield of process mining (PM) that concerns predicting next events, process outcomes and remaining execution times. Unfortunately, researchers use a variety of datasets and ways to split them into training and test sets. The documentation of these preprocessing steps is not always complete. Consequently, research results are hard or even impossible to reproduce and to compare between papers. At times, the use of non-public domain knowledge further hampers the fair competition of ideas. Often the training and test sets are not completely separated, a data leakage problem particular to predictive process monitoring. Moreover, test sets usually suffer from bias in terms of both the mix of case durations and the number of running cases. These…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

raseidi/cosmo
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.