Predicting Mortgage Default with Machine Learning: AutoML, Class Imbalance, and Leakage Control
Xianghong Hu, Tianning Xu, Ying Chen, Shuai Wang

TL;DR
This paper evaluates machine learning methods for mortgage default prediction, emphasizing leakage control and class imbalance handling, and finds AutoML approaches like AutoGluon perform best on real-world data.
Contribution
It introduces leakage-aware feature selection, strict temporal data splitting, and class imbalance techniques, providing a comprehensive comparison of models including AutoML for mortgage default prediction.
Findings
AutoGluon achieves the highest AUROC among evaluated models.
Performance remains stable across different class imbalance ratios.
Leakage control and temporal splitting improve evaluation reliability.
Abstract
Mortgage default prediction is a core task in financial risk management, and machine learning models are increasingly used to estimate default probabilities and provide interpretable signals for downstream decisions. In real-world mortgage datasets, however, three factors frequently undermine evaluation validity and deployment reliability: ambiguity in default labeling, severe class imbalance, and information leakage arising from temporal structure and post-event variables. We compare multiple machine learning approaches for mortgage default prediction using a real-world loan-level dataset, with emphasis on leakage control and imbalance handling. We employ leakage-aware feature selection, a strict temporal split that constrains both origination and reporting periods, and controlled downsampling of the majority class. Across multiple positive-to-negative ratios, performance remains…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFinancial Distress and Bankruptcy Prediction · Imbalanced Data Classification Techniques · Credit Risk and Financial Regulations
