OMOP ETL Framework for Semi-Structured Health Data
Jacob Desmond, Ryan Wartmann, Chng Wei Lau, Steven Thomas, Paul M. Middleton, Jeewani Anupama Ginige

TL;DR
This paper presents a flexible, schema-agnostic framework for transforming diverse healthcare data into the OMOP Common Data Model, supporting relational and document-based sources with validation on large-scale real-world data.
Contribution
The framework extends existing methods by using YAML specifications for schema-agnostic transformation and includes production features like provenance tracking and incremental updates.
Findings
Validated on 2.7 million patient records and 27 million encounters
Achieved 97% data quality passing rate
Supports both relational and document-based data sources
Abstract
Healthcare data are generated in many different formats, which makes it difficult to integrate and reuse across institutions and studies. Standardisation is required to enable consistent large-scale analysis. The OMOP-CDM, developed by the OHDSI community, provides one widely adopted standard. Our framework achieves schema-agnostic transformation by extending upon existing literature in using human-readable YAML specification to support both relational (Microsoft SQL Server (MSSQL)) and document-based (MongoDB) data sources. It also incorporates critical production readiness features: provenance-aware mapping and support for incremental updates. We validated the pipeline using 2.7 million patient records and 27 million encounters across six hospitals spanning two decades of records. The resulting OMOP-CDM dataset demonstrated an acceptable level of data quality with a 97% overall…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsElectronic Health Records Systems · Scientific Computing and Data Management · Machine Learning in Healthcare
