Data Engineering for HPC with Python
Vibhatha Abeykoon, Niranda Perera, Chathura Widanage, Supun, Kamburugamuve, Thejaka Amila Kanewala, Hasara Maithree, Pulasthi, Wickramasinghe, Ahmet Uyar, Geoffrey Fox

TL;DR
This paper introduces a distributed Python API for data engineering in HPC environments, utilizing C++ kernels and MPI for efficient processing of large datasets in scientific applications.
Contribution
It presents a novel high-performance, distributed data engineering framework based on table abstraction, combining Python ease with C++ speed and MPI scalability.
Findings
Achieves efficient large dataset processing in HPC clusters.
Provides a Python API with C++ performance kernels.
Enables flexible data transformations for deep learning applications.
Abstract
Data engineering is becoming an increasingly important part of scientific discoveries with the adoption of deep learning and machine learning. Data engineering deals with a variety of data formats, storage, data extraction, transformation, and data movements. One goal of data engineering is to transform data from original data to vector/matrix/tensor formats accepted by deep learning and machine learning applications. There are many structures such as tables, graphs, and trees to represent data in these data engineering phases. Among them, tables are a versatile and commonly used format to load and process data. In this paper, we present a distributed Python API based on table abstraction for representing and processing data. Unlike existing state-of-the-art data engineering tools written purely in Python, our solution adopts high performance compute kernels in C++, with an in-memory…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
