From Administrative Chaos to Analytical Cohorts: A Three-Stage Normalisation Pipeline for Longitudinal University Administrative Records
H. R. Paz

TL;DR
This paper introduces a three-stage normalisation pipeline for longitudinal university records, enabling consistent data analysis over four decades and addressing structural missingness without imputation.
Contribution
It presents a transparent, reproducible normalisation framework specifically designed for higher education data, handling structural missingness and defining coherent cohorts for analytics.
Findings
Achieved 100% student preservation and full geocoding
56.6% of records had valid school types assigned
Missingness is highly predictable from entry decade and geography
Abstract
The growing use of longitudinal university administrative records in data-driven decision-making often overlooks a critical layer: how raw, inconsistent data are normalised before modelling. This article presents a three-stage normalisation pipeline for a dataset of 24,133 engineering students at a Latin American public university, spanning four decades (1980-2019). The pipeline comprises: (i) N1 CENSAL, harmonising demographics into a single person-level layer; (ii) N1b IDENTITY RESOLUTION, consolidating duplicate identifiers into a canonical ID while preserving an audit trail; and (iii) N1c GEO and SECONDARY-SCHOOL NORMALISATION, which builds reference tables, classifies school types (state national, state provincial, private secular, private religious), and flags irrecoverable cases as DATA_MISSING. The pipeline preserves 100% of students, achieves full geocoding, and yields valid…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Causal Inference Techniques · Statistics Education and Methodologies · Census and Population Estimation
