On-Demand Big Data Integration: A Hybrid ETL Approach for Reproducible Scientific Research
Pradeeban Kathiravelu, Ashish Sharma, Helena Galhardas, Peter Van Roy,, Lu{\i}s Veiga

TL;DR
This paper introduces a hybrid ETL approach combining eager and lazy strategies for scientific data integration, enabling efficient, incremental, and user-driven data sharing across distributed sources, demonstrated with a medical research platform.
Contribution
A novel hybrid ETL method that integrates data and metadata incrementally, incorporating human-in-the-loop and selective data loading for scientific research applications.
Findings
Outperforms eager and lazy ETL in data sharing scenarios
Supports incremental and selective data integration
Enhances scientific research data management
Abstract
Scientific research requires access, analysis, and sharing of data that is distributed across various heterogeneous data sources at the scale of the Internet. An eager ETL process constructs an integrated data repository as its first step, integrating and loading data in its entirety from the data sources. The bootstrapping of this process is not efficient for scientific research that requires access to data from very large and typically numerous distributed data sources. a lazy ETL process loads only the metadata, but still eagerly. Lazy ETL is faster in bootstrapping. However, queries on the integrated data repository of eager ETL perform faster, due to the availability of the entire data beforehand. In this paper, we propose a novel ETL approach for scientific data integration, as a hybrid of eager and lazy ETL approaches, and applied both to data as well as metadata. This way,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Distributed and Parallel Computing Systems · Advanced Data Storage Technologies
