OpenHealth Lake: Designing and testing a data lakehouse platform for health applications
Danilo Silva, Monika Moir, Cheryl Baxter, Tulio de Oliveira, Joicymara Xavier, Marcel Dunaiski

TL;DR
OpenHealth Lake is a flexible, scalable data lakehouse platform designed for health applications, enabling secure, efficient data sharing and management in collaborative global health initiatives.
Contribution
The paper introduces a novel open-source data lakehouse prototype tailored for health data management, emphasizing usability, adaptability, and compliance with FAIR principles.
Findings
Prototype is usable and useful based on user study.
Platform supports multiple interaction methods including API, Python, and R.
Design demonstrates scalability and reproducibility for diverse organizational needs.
Abstract
Data management can be a complex challenge in fields such as bioinformatics and health sciences, which continuously generate extensive heterogeneous datasets. In the context of collaborative global health initiatives, secure storage and sharing of data are crucial to support impactful research. However, the absence of a unified data management platform complicates efficient data exchange and governance within these initiatives. In this paper, we introduce the design process of OpenHealth Lake, a data management prototype platform based on a data lakehouse architecture, data federation, and the FAIR principles. The platform is designed using open-source tools, guided by system requirements identified in previously published studies and complemented by insights from the existing literature. The current prototype platform comprises a user-friendly website, an open API, Python and R…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
