Medical Data Pecking: A Context-Aware Approach for Automated Quality Evaluation of Structured Medical Data
Irena Girshovitz, Atai Ambus, Moni Shahar, Ran Gilad-Bachrach

TL;DR
This paper introduces Medical Data Pecking, a novel context-aware method that uses large language models and software testing concepts to automatically evaluate and improve the quality of structured medical data in electronic health records.
Contribution
It presents a new approach combining software testing and medical knowledge, implemented in the MDPT tool, for systematic and automated EHR data quality assessment.
Findings
Successfully identified 20-43 data issues across datasets
Generated 55-73 tests per cohort for quality evaluation
Demonstrated the effectiveness of LLM-based test suites in medical data quality
Abstract
Background: The use of Electronic Health Records (EHRs) for epidemiological studies and artificial intelligence (AI) training is increasing rapidly. The reliability of the results depends on the accuracy and completeness of EHR data. However, EHR data often contain significant quality issues, including misrepresentations of subpopulations, biases, and systematic errors, as they are primarily collected for clinical and billing purposes. Existing quality assessment methods remain insufficient, lacking systematic procedures to assess data fitness for research. Methods: We present the Medical Data Pecking approach, which adapts unit testing and coverage concepts from software engineering to identify data quality concerns. We demonstrate our approach using the Medical Data Pecking Tool (MDPT), which consists of two main components: (1) an automated test generator that uses large language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsElectronic Health Records Systems · Machine Learning in Healthcare · Data Quality and Management
