TL;DR
This paper addresses the challenge of creating unbiased, leakage-free benchmark datasets for predictive process monitoring by proposing standardized preprocessing steps to improve reproducibility and fair comparison.
Contribution
It introduces a systematic approach to prevent data leakage and bias in benchmark datasets, enhancing reproducibility and fairness in predictive process monitoring research.
Findings
Demonstrates the impact of data leakage on research results
Proposes preprocessing steps for unbiased dataset creation
Improves reproducibility and fairness in benchmarking
Abstract
Advances in AI, and especially machine learning, are increasingly drawing research interest and efforts towards predictive process monitoring, the subfield of process mining (PM) that concerns predicting next events, process outcomes and remaining execution times. Unfortunately, researchers use a variety of datasets and ways to split them into training and test sets. The documentation of these preprocessing steps is not always complete. Consequently, research results are hard or even impossible to reproduce and to compare between papers. At times, the use of non-public domain knowledge further hampers the fair competition of ideas. Often the training and test sets are not completely separated, a data leakage problem particular to predictive process monitoring. Moreover, test sets usually suffer from bias in terms of both the mix of case durations and the number of running cases. These…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
