PipeGen: Data Pipe Generator for Hybrid Analytics
Brandon Haynes, Alvin Cheung, Magdalena Balazinska

TL;DR
PipeGen is a tool that automates high-performance data transfer between different database systems, significantly speeding up data movement for hybrid analytics workloads.
Contribution
It introduces an automated method to generate efficient data pipes leveraging parallel export/import and binary transfer, improving performance over manual methods.
Findings
Achieves up to 3.8x speedup over manual CSV export/import.
Supports multiple DBMSs with automatic pipe generation.
Demonstrates effectiveness in diverse system configurations.
Abstract
We develop a tool called PipeGen for efficient data transfer between database management systems (DBMSs). PipeGen targets data analytics workloads on shared-nothing engines. It supports scenarios where users seek to perform different parts of an analysis in different DBMSs or want to combine and analyze data stored in different systems. The systems may be colocated in the same cluster or may be in different clusters. To achieve high performance, PipeGen leverages the ability of all DBMSs to export, possibly in parallel, data into a common data format, such as CSV or JSON. It automatically extends these import and export functions with efficient binary data transfer capabilities that avoid materializing the transmitted data on the file system. We implement a prototype of PipeGen and evaluate it by automatically generating data pipes between five different DBMSs. Our experiments show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Advanced Database Systems and Queries · Cloud Computing and Resource Management
