Starting Off on the Wrong Foot: Pitfalls in Data Preparation
Jiayi Guo, Panyi Dong, Zhiyu Quan

TL;DR
This paper introduces a new data preparation framework for insurance data that improves model robustness and reliability by using advanced statistical methods for data splitting and feature screening, addressing issues of imbalance and instability.
Contribution
The study proposes a novel data preparation approach combining support points and Chatterjee correlation, integrated into an InsurAutoML pipeline, to enhance insurance modeling reliability.
Findings
Significantly improves model robustness and interpretability.
Reduces computational resource requirements.
Addresses challenges of imbalanced insurance data.
Abstract
When working with real-world insurance data, practitioners often encounter challenges during the data preparation stage that can undermine the statistical validity and reliability of downstream modeling. This study illustrates that conventional data preparation procedures such as random train-test partitioning, often yield unreliable and unstable results when confronted with highly imbalanced insurance loss data. To mitigate these limitations, we propose a novel data preparation framework leveraging two recent statistical advancements: support points for representative data splitting to ensure distributional consistency across partitions, and the Chatterjee correlation coefficient for initial, non-parametric feature screening to capture feature relevance and dependence structure. We further integrate these theoretical advances into a unified, efficient framework that also incorporates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Imbalanced Data Classification Techniques · Generative Adversarial Networks and Image Synthesis
