Unsupervised Machine Learning for Detecting Structural Anomalies in European Regional Statistics
Bogdan Oancea

TL;DR
This paper introduces an unsupervised machine learning framework to detect structurally atypical regional profiles in Europe using Eurostat data, improving validation of high-dimensional socio-economic statistics.
Contribution
It compares five anomaly detection methods and proposes a scalable, reproducible approach for identifying meaningful regional anomalies in European statistics.
Findings
Identified regions with divergent socio-economic profiles, including major cities and disadvantaged areas.
Machine learning methods consistently flagged regions with significant profile deviations.
The framework is compatible with existing validation workflows and scalable for broader use.
Abstract
Ensuring the coherence of regional socio-economic statistics is a central task for national statistical institutes. Traditional validation tools, such as range edits, ratio checks, or univariate outlier detection, are effective for identifying extreme values in individual series but are less suited for detecting unusual combinations of indicators in high-dimensional settings. This paper proposes an unsupervised machine learning framework for identifying structurally atypical regional profiles within Europe using publicly available Eurostat data. We construct a cross-sectional dataset of NUTS2 regions (2022) covering four key indicators: GDP per capita in PPS, unemployment rate, tertiary educational attainment, and population density. We apply and compare five anomaly detection techniques, univariate z-scores, Mahalanobis distance, Isolation Forest, Local Outlier Factor, and One-Class…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
