Outlier detection from ETL Execution trace
Saptarsi Goswami, Samiran Ghosh, Amlan Chakrabarti

TL;DR
This paper presents a method for detecting outliers in ETL process logs using clustering techniques, aiming to identify processes that deviate significantly from normal behavior to improve efficiency.
Contribution
It introduces a proactive outlier detection approach for ETL logs, combining survey-based feature selection and clustering, validated on real production data.
Findings
Reduced analysis scope from 500 to 44 logs (8%)
Identified 2 genuine outlier clusters indicating potential issues
Demonstrated effectiveness of clustering-based outlier detection in ETL logs
Abstract
Extract, Transform, Load (ETL) is an integral part of Data Warehousing (DW) implementation. The commercial tools that are used for this purpose captures lot of execution trace in form of various log files with plethora of information. However there has been hardly any initiative where any proactive analyses have been done on the ETL logs to improve their efficiency. In this paper we utilize outlier detection technique to find the processes varying most from the group in terms of execution trace. As our experiment was carried on actual production processes, any outlier we would consider as a signal rather than a noise. To identify the input parameters for the outlier detection algorithm we employ a survey among developer community with varied mix of experience and expertise. We use simple text parsing to extract these features from the logs, as shortlisted from the survey. Subsequently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
