Missing Values Handling for Machine Learning Portfolios
Andrew Y. Chen, Jack McCoy

TL;DR
This paper investigates the structure of missing data in financial predictors and finds that simple mean imputation often outperforms complex methods due to the data's block structure and low cross-sectional correlation.
Contribution
It characterizes the origins of missingness in financial predictors and evaluates the effectiveness of different missing value handling techniques in machine learning portfolios.
Findings
Simple mean imputation performs well compared to EM methods.
Missingness occurs in large blocks organized by time and source.
Sophisticated imputations can introduce noise and reduce performance.
Abstract
We characterize the structure and origins of missingness for 159 cross-sectional return predictors and study missing value handling for portfolios constructed using machine learning. Simply imputing with cross-sectional means performs well compared to rigorous expectation-maximization methods. This stems from three facts about predictor data: (1) missingness occurs in large blocks organized by time, (2) cross-sectional correlations are small, and (3) missingness tends to occur in blocks organized by the underlying data source. As a result, observed data provide little information about missing data. Sophisticated imputations introduce estimation noise that can lead to underperformance if machine learning is not carefully applied.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFinancial Markets and Investment Strategies · Forecasting Techniques and Applications · Financial Risk and Volatility Modeling
