Improving the output quality of official statistics based on machine   learning algorithms

Quinten Meertens; Cees Diks; Jaap van den Herik; Frank Takes

arXiv:2103.00834·stat.ME·March 2, 2021

Improving the output quality of official statistics based on machine learning algorithms

Quinten Meertens, Cees Diks, Jaap van den Herik, Frank Takes

PDF

Open Access

TL;DR

This paper compares bias correction methods for machine learning models in official statistics, focusing on how they perform under concept drift, especially prior probability shift, to improve output quality.

Contribution

It provides a theoretical and experimental comparison of misclassification and calibration estimators for bias correction under prior probability shift in official statistics.

Findings

01

Bias and variance expressions for both correction methods.

02

Decision boundary for method performance based on accuracy, class distribution, and test size.

03

Practical recommendations for applying machine learning in official statistics.

Abstract

National statistical institutes currently investigate how to improve the output quality of official statistics based on machine learning algorithms. A key obstacle is concept drift, i.e., when the joint distribution of independent variables and a dependent (categorical) variable changes over time. Under concept drift, a statistical model requires regular updating to prevent it from becoming biased. However, updating a model asks for additional data, which are not always available. In the literature, we find a variety of bias correction methods as a promising solution. In the paper, we will compare two popular correction methods: the misclassification estimator and the calibration estimator. For prior probability shift (a specific type of concept drift), we investigate the two correction methods theoretically as well as experimentally. Our theoretical results are expressions for the bias…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Stream Mining Techniques · Machine Learning and Data Classification · Anomaly Detection Techniques and Applications