Accelerating Fresh Data Exploration with Fluid ETL Pipelines
Maxwell Norfolk, Dong Xie

TL;DR
This paper introduces fluid ETL pipelines that enable flexible, on-demand data preprocessing to accelerate fresh data exploration, leveraging idle resources and adapting to evolving user interests.
Contribution
The paper proposes fluid ETL pipelines that allow dynamic, resource-efficient data preprocessing, improving fresh data exploration without blocking ingestion or requiring extensive prior knowledge.
Findings
Viability demonstrated on real-world dataset
Fluid pipelines enable on-demand DPR execution
Adaptive DPR management improves exploration efficiency
Abstract
Recently, we have seen an increasing need for fresh data exploration, where data analysts seek to explore the main characteristics or detect anomalies of data being actively collected. In addition to the common challenges in classic data exploration, such as a lack of prior knowledge about the data or the analysis goal, fresh data exploration also demands an ingestion system with sufficient throughput to keep up with rapid data accumulation. However, leveraging traditional Extract-Transform-Load (ETL) pipelines to achieve low query latency can still be extremely resource-intensive as they must conduct an excessive amount of data preprocessing routines (DPRs) (e.g., parsing and indexing) to cover unpredictable data characteristics and analysis goals. To overcome this challenge, we seek to approach it from a different angle: leveraging occasional idle system capacity or cheap preemptive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Data Management and Algorithms · Advanced Database Systems and Queries
