A Systematic Approach to Cleaning Routine Health Surveillance Datasets: An Illustration Using National Vector Borne Disease Control Programme Data of Punjab, India
Gurpreet Singh, Biju Soman, Arun Mitra

TL;DR
This paper presents a systematic, semi-automated data cleaning approach for health surveillance datasets, demonstrated on dengue data from Punjab, India, improving data quality for epidemiological analysis and decision making.
Contribution
It introduces a logic model and computational workflows for reproducible, scalable data cleaning in health information systems, with successful application to real-world disease surveillance data.
Findings
High success rate in data cleaning and imputation (over 96%)
Effective extraction of demographic information (>98%)
Development of analysis-ready datasets for epidemiological insights
Abstract
Advances in ICT4D and data science facilitate systematic, reproducible, and scalable data cleaning for strengthening routine health information systems. A logic model for data cleaning was used and it included an algorithm for screening, diagnosis, and editing datasets in a rule-based, interactive, and semi-automated manner. Apriori computational workflows and operational definitions were prepared. Model performance was illustrated using the dengue line-list of the National Vector Borne Disease Control Programme, Punjab, India from 01 January 2015 to 31 December 2019. Cleaning and imputation for an estimated date were successful for 96.1% and 98.9% records for the year 2015 and 2016 respectively, and for all cases in the year 2017, 2018, and 2019. Information for age and sex was cleaned and extracted for more than 98.4% and 99.4% records. The logic model application resulted in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData-Driven Disease Surveillance · Healthcare Systems and Reforms · Vaccine Coverage and Hesitancy
